Design an Ads Ranking System with Calibrated pCTR/pCVR and Delayed Conversions
A Staff-level AI system-design walkthrough of the prediction layer behind an ads auction — the calibrated pCTR/pCVR models and delayed-conversion training that systems at Meta, TikTok, Google Ads, and Amazon Ads depend on. You predict click and conversion probabilities, fold in the advertiser bid to rank by eCPM/expected value, defend calibration as a billing-critical SLO, and handle conversions that land T+1/T+3/T+7 days after the click. The distinguishing skill is treating calibration and delayed labels as primary design constraints rather than model-internals afterthoughts.
Frame the problem — the score is a price, not just a ranking
In an ads auction the model’s output is not a ranking — it’s a price. eCPM = pCTR × pCVR × bid × value, and the platform bills against that probability. A model that ranks perfectly but is miscalibrated by 20% overcharges or underspends every advertiser.
The reframe: A naive read of “design an ads ranker” says predict a score and sort by it. That is the Senior answer, and it is wrong in a way that costs real money. In an ads auction the model’s output is consumed three ways — to rank candidates, to price the winner (what the advertiser is billed), and to pace budget over the day. The last two require the number to be a true probability, not just a correctly-ordered score. So the real task is: produce a calibrated expected value per candidate that the auction can allocate, price, and bill against. Everything downstream — billing trust, pacing fairness, advertiser ROI — is built on the assumption that when the model says “pCVR = 0.03” the realized conversion rate over those impressions is actually 3%.
This is why ranking quality is necessary but not sufficient. AUC is invariant to any monotonic transform of the score — multiply every prediction by 0.5, or square it, and the ranking (and the AUC) is unchanged, but the probability is now garbage. A model can have a state-of-the-art AUC and be badly miscalibrated, which makes it perfectly usable for ordering and completely unusable for pricing. The Staff move is to name this in the first two minutes and design the whole system so the auction can trust the probability.
Who asks this and what they probe
Scope of this design
In scope: per-impression pCTR + pCVR + predicted value, the calibration layer, delayed-label training, and the eCPM hand-off into allocation and pricing. Upstream context (not owned here): candidate generation / retrieval, creative ranking, and brand-safety filtering produce the candidate set we score. Downstream (we feed it): the auction service does allocation and pricing on top of our calibrated probabilities.
- Functional: for each candidate impression, emit calibrated pCTR and pCVR (and a predicted value where applicable), combine into eCPM, and hand off to the auction.
- Non-functional: p99 ML scoring slice in the low tens of milliseconds; per-slice calibration ratio (predicted / actual) held within roughly [0.95, 1.05]; label pipeline correct under a T+7 attribution window.
Requirements, scale, and the eCPM objective
Scale envelope
These are defensible order-of-magnitude numbers for a large platform; the exact figures matter less than the shape (huge fan-out, tiny per-request budget, slow labels).
The objective function
We rank candidates by expected value per impression, expressed as eCPM (effective cost per mille — per thousand impressions):
- pCTR, pCVR are modeled — predicted per impression.
- bid and advertiser_value come from the campaign (advertiser-supplied).
- pacing_multiplier is a controller output (Step 6), not part of the probability.
The objective specializes by campaign type. CPC campaigns charge per click, so expected revenue is pCTR × bid — pCVR does not enter. CPA / value-optimized campaigns charge for outcomes, so eCPM must fold in pCVR × conversion_value. The same model serves both; what changes is which terms the auction multiplies in.
Why calibration is load-bearing
The same probability is reused three ways: ranking, pricing (what the winner pays), and pacing (how fast budget burns). Because the number is reused for pricing and pacing — not just sorting — a systematic error propagates as money, not just rank noise. A 1.2× miscalibration in pCVR systematically overspends CPA campaigns by ~20%: the system thinks each impression is worth 20% more than it is, bids 20% too high, and burns budget on impressions that don’t convert at the assumed rate. Advertisers see a realized CPA 20% above target and lose trust.
Latency and freshness budget
Two clocks are in tension and this whole design exists to reconcile them. Features must be joined within the request (serving-fresh: real-time user engagement, current campaign state). Conversion labels lag by days. You serve on fresh features but learn from stale labels — Step 5 is dedicated to making that not bias the model downward.
Estimation and architecture — retrieval → L1 → L2 → auction
Two-stage ranking funnel
L1/L2 consistency (a Staff concern)
The cheap L1 model exists only to make L2 affordable — but if L1 prunes a candidate that L2 would have ranked at the top, that value is gone before the good model ever sees it, and no downstream metric will tell you, because L2 never had the chance to score it. Measure funnel recall@k: of the candidates L2 ranks in its top-k, what fraction survived L1? Keep L1 and L2 aligned by distilling L2 → L1 (train L1 to mimic L2’s scores) so the cheap model approximates the expensive one’s ordering rather than diverging from it.
Feature store and joins
An online store (low-latency KV, e.g. Redis or a RocksDB-backed store) serves user / ad / context features at request time; an offline store generates training data. Use the same transformation code online and offline to avoid train/serve skew — a feature computed one way in training and another in serving silently corrupts both ranking and calibration.
Features fall in four buckets: user (real-time engagement, history), ad (campaign, creative, advertiser), cross (user × ad affinity), and context (placement, device, time of day). The serving risk is tail latency on the join: precompute and cache embeddings for hot entities so the p99 join doesn’t blow the budget.
The scoring-to-auction hand-off
L2 (after the calibration layer) emits a calibrated pCTR and pCVR. A separate auction service multiplies by bid / value / pacing to get eCPM, runs allocation, then prices the winner. Keeping calibration inside the model boundary is the contract: the auction team can treat the number as a true probability without re-deriving it, and the modeling team owns the guarantee that it is one.
The models — pCTR, pCVR, and entire-space multi-task value modeling
Features and embeddings
The base is a standard large-scale deep recommender: large embedding tables for sparse categorical IDs (user, ad, campaign, creative — millions to billions of values) feeding a DNN with explicit feature crossing (a DCN-style cross network, or deep-and-cross experts) into task-specific towers. The embedding tables dominate memory; the cross network captures user × ad interactions that a plain MLP misses.
Why two heads, one backbone
pCTR and pCVR share the embedding and bottom layers. The reason is data, not elegance: CVR data is far sparser than CTR data — every impression can be a click label, but only clicks can be conversion labels, and only a small fraction of clicks convert. Sharing embeddings is transfer learning that lets the data-rich pCTR signal rescue the data-starved pCVR head — the conversion tower inherits representations learned from orders of magnitude more click data.
Entire-space modeling (ESMM)
There is a subtle bias if you train pCVR naively. Naive pCVR trains only on clicked impressions (those are the only ones with a conversion label) but serves on all impressions — a textbook sample-selection bias, because the click population is not the impression population.
ESMM (Entire Space Multi-Task Model, Alibaba 2018) fixes this by modeling pCTCVR = pCTR × pCVR over the entire impression space, treating pCVR as an intermediate variable that is never trained directly on the biased click-only subset:
Because the conversion loss is supervised through pCTCVR over all impressions, pCVR is learned implicitly over the entire space and is valid when applied to any impression — killing the sample-selection bias.
MMoE task towers
MMoE (Multi-gate Mixture-of-Experts) replaces a single shared bottom with a set of shared experts plus per-task gating networks. Each task (pCTR, pCVR) learns its own gate, so some experts specialize in click behavior, others in conversion behavior, and the gates mix them per task. This reduces task interference — the harm a single shared bottom does when click and conversion objectives pull representations in different directions. It is used in production ads/recommender stacks (e.g. Etsy, Uber’s heterogeneous MMoE).
Loss
The base loss is cross-entropy (log-loss). This is not a default — it is a deliberate calibration choice. Cross-entropy is a proper scoring rule: minimizing it drives the model toward calibrated probabilities, unlike pure ranking losses (pairwise/listwise) that only care about order and can leave the absolute scores arbitrarily scaled. Since the auction needs a true probability, the loss that rewards calibrated probabilities is the right base.
Data model for labels and calibration — the billing-critical SLO
Metric: ECE and calibration ratio
You cannot defend calibration as an SLO without a number. Two complementary ones:
ECE captures shape (over-confident in some ranges, under in others); the calibration ratio captures aggregate over/under-prediction per slice and maps directly to over/under-spend.
Global recalibration
Calibration is fit on a held-out post-training set, separate from training, by mapping raw model scores to corrected probabilities:
Isotonic is the most flexible but needs data to avoid overfitting the calibration curve; Platt’s two parameters are robust on thin slices; temperature is a one-knob default for DNNs.
Per-slice calibration is where money leaks
This is the Staff insight. A globally-calibrated model can be badly miscalibrated on specific slices — new advertisers, cold campaigns, rare conversion types, new placements — precisely the slices with little data, and precisely where revenue leaks because nobody is watching them. The fix is to fit segment-level recalibrators and monitor ECE per slice, not just globally. A global ECE of 0.5% can hide a 30% over-prediction on new advertisers that silently overcharges every one of them.
Online cadence and guardrails
Calibration drifts with traffic mix and seasonality and cannot be fixed in real time — recalibration needs a refit on fresh outcomes. So: refresh recalibration frequently (hourly to daily) and run a fast online correction layer on top of the slower base model to absorb short-term drift between refits.
Guardrails: when a per-slice calibration ratio breaches the band, auto-rollback to the previous recalibrator (and page). Large systems (e.g. LinkedIn’s LiRank) treat calibration drift as a paging-level SLO, not a dashboard nicety. One trap to watch: maximization bias — because the auction picks the max-eCPM candidate out of hundreds, the realized rate among winners systematically exceeds the average prediction, even when the model is calibrated on the full population. Monitor calibration conditioned on winning, not just on all impressions.
High-level training architecture — delayed and sparse conversions
The delayed-feedback bias
This is the part most candidates get wrong. Conversions arrive hours to days after the click. If you label at training time, a recent click whose conversion hasn’t landed yet gets marked negative — but it may convert tomorrow. So pCVR is biased systematically downward, and the bias is worst on the freshest data — the very data you most want for capturing current trends. Naive labeling thus trades away exactly the freshness that makes online retraining worthwhile.
Strategy A: waiting window
Hold each click for a fixed attribution window (T+1 / T+3 / T+7 days) before assigning its label. By the time you train on it, the label is (almost) settled.
- Pro: low label noise — the negative is a real negative.
- Con: the model is stale by the length of the window — a 7-day window means the model never sees the last week of behavior, missing fast-moving trends, new campaigns, and seasonality shifts.
Strategy B: ingest-then-correct
Train on data immediately, treating not-yet-converted as a tentative negative, then correct the resulting distribution shift:
- DEFER (Delayed Feedback with Real negatives) duplicates samples and ingests real negatives alongside the corrected positives, using importance sampling to reweight the loss and undo the distribution shift.
- PU (positive-unlabeled) loss treats the biased negatives as unlabeled rather than true negatives — a not-yet-converted click is "unknown," not "won't convert."
- Pro: fresh — the model sees today's data immediately.
- Con: the reweighting must be careful or variance blows up (large importance weights).
Attribution windows and snapshots
The attribution window (commonly 1-day-click and 7-day-click) is a business/legal choice that defines when a conversion counts. Critical constraint: the model’s label pipeline must match the billing attribution window. If billing credits 7-day conversions but pCVR is trained against a 1-day label, the model is calibrated to a different target than what advertisers are billed for — calibration looks fine in offline eval and is wrong in dollars.
Operationally, maintain T+1 / T+3 / T+7 conversion snapshots. Continuous training reads the freshest snapshot and label-corrects older samples as their true labels resolve — a click that was a tentative negative in the T+1 snapshot becomes a positive once its conversion lands in the T+3 snapshot.
The Staff answer picks by reasoning about this table for the specific traffic — short windows and fast trends favor ingest-then-correct; very long, sparse conversions with stable patterns can tolerate a waiting window.
Deep dive — auction integration, budget pacing, and value-based bidding
WHERE STAFF IS WONThis is where Staff is won. Everything above produces a calibrated probability; this section is about keeping the boundaries between modeling, allocation, pricing, and pacing clean — because the most expensive failures here are systems that smear those concerns together and corrupt the probability.
Allocation vs pricing are separate
The auction does two distinct things and conflating them is a classic error:
- Allocation (rank / who wins): sort by eCPM = pCTR × pCVR × bid × value. The model feeds this.
- Pricing (what the winner pays): charge based on the externality the winner imposes on others — not its own bid.
The model feeds allocation; pricing is auction logic layered on top. This separation is why the probability must be calibrated: GSP and VCG both compute the price from the predicted rates, so a miscalibrated pCVR doesn’t just mis-rank — it mis-bills.
Budget pacing as a controller
A campaign with a daily budget should spend smoothly across the day, not exhaust its budget by noon and miss the afternoon audience. Pacing is a control problem:
- PID controllers are the industry default — compute the error between actual and target spend, and adjust a bid multiplier in real time (proportional + integral + derivative terms damp oscillation).
- Probabilistic throttling sets a pacing_rate ∈ [0, 1] — e.g. 0.34 means a 34% chance the campaign even enters the auction this request.
- MPC (model predictive control) looks ahead when supply is volatile (predictable traffic spikes).
Pacing must not corrupt calibration
Here is the trap that separates Staff from Senior. Pacing as a bid multiplier on eCPM is fine. But if pacing logic is folded into the predicted rate — say, someone scales pCVR down to slow a campaign — it corrupts calibration: the model’s pCVR is no longer a true probability, and now billing and every other consumer of that number are wrong. The rule is a hard boundary: pacing multiplies eCPM downstream of the model; it never touches the probability. The pCTR/pCVR coming out of the model stay true probabilities no matter how aggressively a campaign is being throttled.
Value-based / target-CPA bidding
Increasingly advertisers bid for outcomes, not clicks. Under target-CPA the system effectively bids:
This makes the calibration SLO a dollar contract: if pCVR is miscalibrated by 20%, the effective bid is off by 20%, and the advertiser’s realized CPA misses target by 20% — the direct dollar consequence of Step 4’s SLO. The calibration band [0.95, 1.05] is not an abstract quality bar; it is the tolerance on every advertiser’s realized CPA.
Cross-team ownership (the Staff signal)
The distinguishing Staff behavior is naming the interfaces and their guardrails as explicit contracts with other teams:
- Calibration SLO is the contract between the modeling team and the auction/billing team — they consume the probability as a price.
- Funnel recall is the contract with the retrieval team — they guarantee L1 doesn't prune what L2 would value.
- Attribution window is the contract with the billing/measurement team — the label target must equal the billing target.
Naming these interfaces, their owners, and their auto-rollback guardrails is what distinguishes Staff from a strong Senior who optimizes one model in isolation.
Rollout — metrics, experimentation, and online evaluation
Three metric families
A single number cannot tell you if this system is healthy; you need three families, and a change can win one while losing another.
GAUC (AUC computed per user/group then averaged) matters because global AUC can be inflated by easy cross-user ordering while per-user ranking is poor.
A/B with guardrails
Run budget-split or user-split experiments with guardrail metrics on per-slice calibration and pacing fairness, not just topline revenue. The failure to catch: a treatment that lifts revenue by silently overcharging CPA campaigns is a loss — advertisers churn next quarter. Watch advertiser-side harm (realized CPA drift) as a hard guardrail that can block a revenue-positive launch.
Counterfactual / replay evaluation
There is a real offline/online gap: AUC improvements offline routinely fail to move online revenue, because of feedback loops and selection bias (the model only sees outcomes on ads it chose to show). Counterfactual / replay evaluation and inverse-propensity weighting (IPW) estimate online impact before shipping — replay logged auctions under the new model, reweighting by the propensity of the logged action, to get an unbiased estimate without a full A/B.
Drift detection and retrain triggers
Monitor the prediction distribution, per-slice calibration ratio, and feature staleness. Trigger retrain or recalibration when calibration breaches the band or the conversion mix shifts (seasonality, a new advertiser cohort). And keep the maximization-bias guardrail from Step 4: monitor calibration conditioned on winning — realized rates among winners can exceed average predictions even for a globally-calibrated model, and that’s the population that actually gets billed.
Bottlenecks, edge cases, and failure modes
Cold-start (new advertiser / campaign / creative):
- No click or conversion history → high-variance pCTR/pCVR.
- Mitigate with content/embedding priors (use creative and category features when ID features are empty), an exploration budget (Thompson sampling / UCB) to gather data, and Platt-scaled per-slice recalibration for the sparse slice.
Feedback loops and selection bias:
- The model only observes clicks/conversions on ads it chose to show, so its training data is shaped by its own past decisions — selection bias that compounds over time.
- Mitigate with exploration, ESMM entire-space modeling, and propensity correction (IPW).
Click / conversion fraud and bots:
- Inflate CTR and corrupt labels — a bot farm makes a creative look great and poisons training.
- Filter invalid traffic upstream, exclude it from training, and monitor anomalous per-slice rates as an early signal.
Infra bottlenecks:
- Embedding-table memory and feature-join tail latency dominate.
- Shard embeddings across hosts, cache hot entities, bound the candidate count entering L2, and degrade gracefully to L1-only scores under load rather than blowing the latency budget.
Delayed-label edge cases:
- Conversions beyond the attribution window are uncountable by design — accept the truncation bias (it's a definitional choice, matched to billing).
- Duplicate-conversion dedup and cross-device attribution are measurement traps that silently bias pCVR — a conversion double-counted or attributed to the wrong device corrupts the label the model is calibrated against.
Summary
Pillar 1 — the score is a price. Rank by eCPM = pCTR × pCVR × bid × value, and own calibration as a billing-critical SLO (ECE + per-slice calibration ratio) with guardrails and auto-rollback. A miscalibrated model bills wrong even when it ranks perfectly.
Pillar 2 — delayed labels are a design constraint, not a footnote. Choose waiting-window vs ingest-then-correct (DEFER / PU loss + importance sampling) by reasoning about the bias/variance/freshness tradeoff, and match the attribution window to billing so pCVR is calibrated to the target advertisers are actually charged against.
Pillar 3 — entire-space, multi-task value modeling. Use ESMM (pCTCVR = pCTR × pCVR over all impressions) to kill sample-selection bias, share embeddings to rescue the data-starved pCVR head, and MMoE towers to reduce task interference — addressing CVR sparsity head-on.
Clean boundaries are the Staff tell: pacing multiplies eCPM, never the probability; allocation (rank) is separate from pricing (GSP/VCG externality); and the calibration SLO, funnel recall, and attribution window are explicit contracts with the billing, retrieval, and measurement teams.
Through the role lenses: the SDE owns the funnel and join latency but must reason about why the output is a probability; the MLE owns calibration and delayed-label correctness and ties each to a revenue consequence; the switcher transfers serving/freshness instincts and builds the new muscle of “score = probability = price.”
If you remember one thing: a perfectly-ranked but miscalibrated ads model is worse than useless — it bills wrong and paces wrong. Calibration plus correct delayed labels are the whole game; the bigger model is secondary.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.