AI System DesignStaffCalibrationDelayed Conversions

Design an Ads Ranking System with Calibrated pCTR/pCVR and Delayed Conversions

A Staff-level AI system-design walkthrough of the prediction layer behind an ads auction — the calibrated pCTR/pCVR models and delayed-conversion training that systems at Meta, TikTok, Google Ads, and Amazon Ads depend on. You predict click and conversion probabilities, fold in the advertiser bid to rank by eCPM/expected value, defend calibration as a billing-critical SLO, and handle conversions that land T+1/T+3/T+7 days after the click. The distinguishing skill is treating calibration and delayed labels as primary design constraints rather than model-internals afterthoughts.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Can you explain why an ads ranker must output a CALIBRATED probability, not just a correctly-ranked score — and what breaks (billing, pacing) when it isn't?

·Do you handle delayed conversions correctly — naive labeling biases pCVR downward, and you can name a fix (waiting window, DEFER/fake-negative correction, PU loss + importance sampling)?

·Can you separate the eCPM/expected-value formula (pCTR × pCVR × bid × value) from the model, and reason about GSP/VCG pricing on top of it?

·Do you treat label freshness, attribution window, and online recalibration as first-class operational concerns with metrics (ECE, calibration ratio), not afterthoughts?

★ STAFF-LEVEL SIGNALS

★Defends a SLO on calibration error (ECE / predicted-over-actual ratio per slice) as the contract between the model and the auction, with auto-rollback when it drifts.

★Picks a delayed-feedback strategy by reasoning about the bias/variance and freshness tradeoff (waiting-window latency vs ingest-then-correct), not by naming one paper.

★Designs the early-stage (retrieval/L1) and late-stage (L2 ranker) consistency so the cheap model doesn't prune candidates the expensive model would have valued — and monitors recall of the funnel.

★Treats per-slice calibration (new advertisers, cold campaigns, rare conversion types) as where money leaks, and builds segment-level recalibration + guardrails rather than a single global fit.

Frame the problem — the score is a price, not just a ranking

In an ads auction the model’s output is not a ranking — it’s a price. eCPM = pCTR × pCVR × bid × value, and the platform bills against that probability. A model that ranks perfectly but is miscalibrated by 20% overcharges or underspends every advertiser.

The reframe: A naive read of “design an ads ranker” says predict a score and sort by it. That is the Senior answer, and it is wrong in a way that costs real money. In an ads auction the model’s output is consumed three ways — to rank candidates, to price the winner (what the advertiser is billed), and to pace budget over the day. The last two require the number to be a true probability, not just a correctly-ordered score. So the real task is: produce a calibrated expected value per candidate that the auction can allocate, price, and bill against. Everything downstream — billing trust, pacing fairness, advertiser ROI — is built on the assumption that when the model says “pCVR = 0.03” the realized conversion rate over those impressions is actually 3%.

This is why ranking quality is necessary but not sufficient. AUC is invariant to any monotonic transform of the score — multiply every prediction by 0.5, or square it, and the ranking (and the AUC) is unchanged, but the probability is now garbage. A model can have a state-of-the-art AUC and be badly miscalibrated, which makes it perfectly usable for ordering and completely unusable for pricing. The Staff move is to name this in the first two minutes and design the whole system so the auction can trust the probability.

Who asks this and what they probe

Role

What they own

What they probe here

SDE (serving)

Auction serving path, candidate volume from retrieval, low-latency feature joins, the eCPM loop, L1/L2 funnel consistency

Can you keep p99 scoring inside budget with 10^4–10^6 candidates fanning down to a few hundred scored, and reason about WHY the number must be a probability, not just plumb a black box?

MLE (modeling)

pCTR/pCVR models, calibration, delayed-label training, multi-task value modeling

Calibration metrics and methods (ECE, isotonic/Platt), the delayed-feedback / PU training problem, ESMM/MMoE, attribution windows — and tying each to a revenue consequence

Switcher (SDE → AI)

Bringing serving/freshness instincts into the ML layer

Serving and data-freshness instincts transfer directly; the new muscle is "score = probability = price" and label correctness under delay

Scope of this design

In scope: per-impression pCTR + pCVR + predicted value, the calibration layer, delayed-label training, and the eCPM hand-off into allocation and pricing. Upstream context (not owned here): candidate generation / retrieval, creative ranking, and brand-safety filtering produce the candidate set we score. Downstream (we feed it): the auction service does allocation and pricing on top of our calibrated probabilities.

Functional: for each candidate impression, emit calibrated pCTR and pCVR (and a predicted value where applicable), combine into eCPM, and hand off to the auction.
Non-functional: p99 ML scoring slice in the low tens of milliseconds; per-slice calibration ratio (predicted / actual) held within roughly [0.95, 1.05]; label pipeline correct under a T+7 attribution window.

Requirements, scale, and the eCPM objective

Scale envelope

These are defensible order-of-magnitude numbers for a large platform; the exact figures matter less than the shape (huge fan-out, tiny per-request budget, slow labels).

Dimension

Order of magnitude

Implication

Ad requests / sec (peak)

10^5 – 10^6

Scoring cost is multiplied by candidate fan-out — must prune hard

Candidates from retrieval

10^3 – 10^4 per request

Cannot run the heavy DNN on all of them

Candidates scored by L2

~few hundred

The funnel exists to make this affordable

End-to-end ads budget

~50–150 ms

ML scoring slice ~10–30 ms of that

Conversion label lag

hours to T+7 days

Labels arrive long after serving — the core tension

The objective function

We rank candidates by expected value per impression, expressed as eCPM (effective cost per mille — per thousand impressions):

eCPM = pCTR × pCVR × bid × advertiser_value × pacing_multiplier × 1000

\____________/ \________________/ \________________/

modeled given by campaign controller out

pCTR, pCVR are modeled — predicted per impression.
bid and advertiser_value come from the campaign (advertiser-supplied).
pacing_multiplier is a controller output (Step 6), not part of the probability.

The objective specializes by campaign type. CPC campaigns charge per click, so expected revenue is pCTR × bid — pCVR does not enter. CPA / value-optimized campaigns charge for outcomes, so eCPM must fold in pCVR × conversion_value. The same model serves both; what changes is which terms the auction multiplies in.

Why calibration is load-bearing

The same probability is reused three ways: ranking, pricing (what the winner pays), and pacing (how fast budget burns). Because the number is reused for pricing and pacing — not just sorting — a systematic error propagates as money, not just rank noise. A 1.2× miscalibration in pCVR systematically overspends CPA campaigns by ~20%: the system thinks each impression is worth 20% more than it is, bids 20% too high, and burns budget on impressions that don’t convert at the assumed rate. Advertisers see a realized CPA 20% above target and lose trust.

Latency and freshness budget

Two clocks are in tension and this whole design exists to reconcile them. Features must be joined within the request (serving-fresh: real-time user engagement, current campaign state). Conversion labels lag by days. You serve on fresh features but learn from stale labels — Step 5 is dedicated to making that not bias the model downward.

Estimation and architecture — retrieval → L1 → L2 → auction

Two-stage ranking funnel

┌───────────────────────────────────────────────┐

│ AD REQUEST (user+ctx) │

└───────────────────────┬───────────────────────┘

│

┌─────────────────▼─────────────────┐

│ RETRIEVAL (ANN + rules) │

│ 10^3–10^4 candidates │

└─────────────────┬─────────────────┘

│

┌─────────────────▼─────────────────┐

│ L1 LIGHT RANKER (cheap model) │

│ prune to ~hundreds │

└─────────────────┬─────────────────┘

│

┌─────────────────▼─────────────────┐

│ L2 HEAVY RANKER (pCTR/pCVR DNN) │

│ full multi-task model │

└─────────────────┬─────────────────┘

│

┌─────────────────▼─────────────────┐

│ CALIBRATION LAYER (per-slice) │

│ emits TRUE probabilities │

└─────────────────┬─────────────────┘

│

┌─────────────────▼─────────────────┐

│ eCPM + PACING multiplier │

└─────────────────┬─────────────────┘

│

┌─────────────────▼─────────────────┐

│ AUCTION: allocation + pricing │

│ (GSP / VCG) │

└─────────────────┬─────────────────┘

│

impression / click / conv logs

│

(delayed conversion join → training)

Component

Job

Latency / consistency concern

Retrieval

ANN + rules → 10^3–10^4 candidates

Recall of the candidate pool; brand-safety filters

L1 light ranker

Cheap prune to ~hundreds

Must not drop candidates L2 would value (funnel recall)

L2 heavy ranker

Full pCTR/pCVR DNN

Embedding memory, scoring p99

Calibration layer

Map raw scores to true probabilities, per slice

Refit cadence; per-slice drift

eCPM + pacing

Combine probability × bid × value × pace

Pacing must not touch the probability

Auction

Allocate by eCPM, price winner

Allocation vs pricing separation

L1/L2 consistency (a Staff concern)

The cheap L1 model exists only to make L2 affordable — but if L1 prunes a candidate that L2 would have ranked at the top, that value is gone before the good model ever sees it, and no downstream metric will tell you, because L2 never had the chance to score it. Measure funnel recall@k: of the candidates L2 ranks in its top-k, what fraction survived L1? Keep L1 and L2 aligned by distilling L2 → L1 (train L1 to mimic L2’s scores) so the cheap model approximates the expensive one’s ordering rather than diverging from it.

Feature store and joins

An online store (low-latency KV, e.g. Redis or a RocksDB-backed store) serves user / ad / context features at request time; an offline store generates training data. Use the same transformation code online and offline to avoid train/serve skew — a feature computed one way in training and another in serving silently corrupts both ranking and calibration.

Features fall in four buckets: user (real-time engagement, history), ad (campaign, creative, advertiser), cross (user × ad affinity), and context (placement, device, time of day). The serving risk is tail latency on the join: precompute and cache embeddings for hot entities so the p99 join doesn’t blow the budget.

The scoring-to-auction hand-off

L2 (after the calibration layer) emits a calibrated pCTR and pCVR. A separate auction service multiplies by bid / value / pacing to get eCPM, runs allocation, then prices the winner. Keeping calibration inside the model boundary is the contract: the auction team can treat the number as a true probability without re-deriving it, and the modeling team owns the guarantee that it is one.

The models — pCTR, pCVR, and entire-space multi-task value modeling

Features and embeddings

The base is a standard large-scale deep recommender: large embedding tables for sparse categorical IDs (user, ad, campaign, creative — millions to billions of values) feeding a DNN with explicit feature crossing (a DCN-style cross network, or deep-and-cross experts) into task-specific towers. The embedding tables dominate memory; the cross network captures user × ad interactions that a plain MLP misses.

Why two heads, one backbone

pCTR and pCVR share the embedding and bottom layers. The reason is data, not elegance: CVR data is far sparser than CTR data — every impression can be a click label, but only clicks can be conversion labels, and only a small fraction of clicks convert. Sharing embeddings is transfer learning that lets the data-rich pCTR signal rescue the data-starved pCVR head — the conversion tower inherits representations learned from orders of magnitude more click data.

Entire-space modeling (ESMM)

There is a subtle bias if you train pCVR naively. Naive pCVR trains only on clicked impressions (those are the only ones with a conversion label) but serves on all impressions — a textbook sample-selection bias, because the click population is not the impression population.

ESMM (Entire Space Multi-Task Model, Alibaba 2018) fixes this by modeling pCTCVR = pCTR × pCVR over the entire impression space, treating pCVR as an intermediate variable that is never trained directly on the biased click-only subset:

# Trained over ALL impressions (not just clicks):

pCTR = f_ctr(x) # click | impression

pCTCVR = pCTR * pCVR # convert & click | impression

pCVR = f_cvr(x) # convert | click (latent)

# Losses are computed on pCTR and pCTCVR over impressions

# (pCVR itself is never given a direct loss):

L = CE(click, pCTR)

+ CE(conversion, pCTCVR) # conversion=1 only if clicked & converted

# pCVR is recovered validly as pCTCVR / pCTR at serving time.

Because the conversion loss is supervised through pCTCVR over all impressions, pCVR is learned implicitly over the entire space and is valid when applied to any impression — killing the sample-selection bias.

MMoE task towers

MMoE (Multi-gate Mixture-of-Experts) replaces a single shared bottom with a set of shared experts plus per-task gating networks. Each task (pCTR, pCVR) learns its own gate, so some experts specialize in click behavior, others in conversion behavior, and the gates mix them per task. This reduces task interference — the harm a single shared bottom does when click and conversion objectives pull representations in different directions. It is used in production ads/recommender stacks (e.g. Etsy, Uber’s heterogeneous MMoE).

Approach

Pros

Cons

Two separate models

Simple, independent

No transfer; pCVR starves on sparse data; sample-selection bias

ESMM

Kills selection bias; pCVR valid over full space; shared embeddings

Still one shared bottom — task interference possible

MMoE (+ ESMM)

Per-task gating reduces interference; specialization

More params, more tuning

Loss

The base loss is cross-entropy (log-loss). This is not a default — it is a deliberate calibration choice. Cross-entropy is a proper scoring rule: minimizing it drives the model toward calibrated probabilities, unlike pure ranking losses (pairwise/listwise) that only care about order and can leave the absolute scores arbitrarily scaled. Since the auction needs a true probability, the loss that rewards calibrated probabilities is the right base.

Data model for labels and calibration — the billing-critical SLO

Metric: ECE and calibration ratio

You cannot defend calibration as an SLO without a number. Two complementary ones:

ECE = Σ_b (n_b / N) · | mean_confidence(b) − observed_rate(b) |

bins predictions into b buckets; per bucket compare the average

predicted probability to the actual outcome rate; weight by bin

size. Lower is better; 0 = perfectly calibrated.

calibration_ratio(slice) = Σ predicted(slice) / Σ actual(slice)

target ≈ 1.0 ; guardrail band ≈ [0.95, 1.05]

ECE captures shape (over-confident in some ranges, under in others); the calibration ratio captures aggregate over/under-prediction per slice and maps directly to over/under-spend.

Global recalibration

Calibration is fit on a held-out post-training set, separate from training, by mapping raw model scores to corrected probabilities:

Method

Form

Best when

Isotonic regression

Non-parametric, any monotonic shape

Calibration data plentiful (global, high-traffic)

Platt scaling

2-parameter sigmoid

Sparse slices (new advertisers) — data-efficient

Temperature scaling

Single parameter

DNN logits, quick global correction

Isotonic is the most flexible but needs data to avoid overfitting the calibration curve; Platt’s two parameters are robust on thin slices; temperature is a one-knob default for DNNs.

Per-slice calibration is where money leaks

This is the Staff insight. A globally-calibrated model can be badly miscalibrated on specific slices — new advertisers, cold campaigns, rare conversion types, new placements — precisely the slices with little data, and precisely where revenue leaks because nobody is watching them. The fix is to fit segment-level recalibrators and monitor ECE per slice, not just globally. A global ECE of 0.5% can hide a 30% over-prediction on new advertisers that silently overcharges every one of them.

Online cadence and guardrails

Calibration drifts with traffic mix and seasonality and cannot be fixed in real time — recalibration needs a refit on fresh outcomes. So: refresh recalibration frequently (hourly to daily) and run a fast online correction layer on top of the slower base model to absorb short-term drift between refits.

Guardrails: when a per-slice calibration ratio breaches the band, auto-rollback to the previous recalibrator (and page). Large systems (e.g. LinkedIn’s LiRank) treat calibration drift as a paging-level SLO, not a dashboard nicety. One trap to watch: maximization bias — because the auction picks the max-eCPM candidate out of hundreds, the realized rate among winners systematically exceeds the average prediction, even when the model is calibrated on the full population. Monitor calibration conditioned on winning, not just on all impressions.

High-level training architecture — delayed and sparse conversions

The delayed-feedback bias

This is the part most candidates get wrong. Conversions arrive hours to days after the click. If you label at training time, a recent click whose conversion hasn’t landed yet gets marked negative — but it may convert tomorrow. So pCVR is biased systematically downward, and the bias is worst on the freshest data — the very data you most want for capturing current trends. Naive labeling thus trades away exactly the freshness that makes online retraining worthwhile.

Strategy A: waiting window

Hold each click for a fixed attribution window (T+1 / T+3 / T+7 days) before assigning its label. By the time you train on it, the label is (almost) settled.

Pro: low label noise — the negative is a real negative.
Con: the model is stale by the length of the window — a 7-day window means the model never sees the last week of behavior, missing fast-moving trends, new campaigns, and seasonality shifts.

Strategy B: ingest-then-correct

Train on data immediately, treating not-yet-converted as a tentative negative, then correct the resulting distribution shift:

# Fake-negative / DEFER-style correction:

# 1. Ingest every click immediately as a (tentative) negative.

# 2. When a conversion later lands, ingest it as a positive

# (a duplicate of the same click — a "real" delayed positive).

# 3. Reweight with importance sampling to undo the bias from

# counting eventual-positives as initial-negatives:

L = Σ_i w_i · CE(y_i, p_i)

where w_i corrects p(observed label) vs p(true label)

DEFER (Delayed Feedback with Real negatives) duplicates samples and ingests real negatives alongside the corrected positives, using importance sampling to reweight the loss and undo the distribution shift.
PU (positive-unlabeled) loss treats the biased negatives as unlabeled rather than true negatives — a not-yet-converted click is "unknown," not "won't convert."
Pro: fresh — the model sees today's data immediately.
Con: the reweighting must be careful or variance blows up (large importance weights).

Attribution windows and snapshots

The attribution window (commonly 1-day-click and 7-day-click) is a business/legal choice that defines when a conversion counts. Critical constraint: the model’s label pipeline must match the billing attribution window. If billing credits 7-day conversions but pCVR is trained against a 1-day label, the model is calibrated to a different target than what advertisers are billed for — calibration looks fine in offline eval and is wrong in dollars.

Operationally, maintain T+1 / T+3 / T+7 conversion snapshots. Continuous training reads the freshest snapshot and label-corrects older samples as their true labels resolve — a click that was a tentative negative in the T+1 snapshot becomes a positive once its conversion lands in the T+3 snapshot.

Strategy

Bias

Variance

Freshness

Waiting window

Low

Poor (stale by window length)

Ingest-then-correct (DEFER / PU)

Low if reweighted right

Higher (importance weights)

High (sees data now)

The Staff answer picks by reasoning about this table for the specific traffic — short windows and fast trends favor ingest-then-correct; very long, sparse conversions with stable patterns can tolerate a waiting window.

Deep dive — auction integration, budget pacing, and value-based bidding

WHERE STAFF IS WON

This is where Staff is won. Everything above produces a calibrated probability; this section is about keeping the boundaries between modeling, allocation, pricing, and pacing clean — because the most expensive failures here are systems that smear those concerns together and corrupt the probability.

Allocation vs pricing are separate

The auction does two distinct things and conflating them is a classic error:

Allocation (rank / who wins): sort by eCPM = pCTR × pCVR × bid × value. The model feeds this.
Pricing (what the winner pays): charge based on the externality the winner imposes on others — not its own bid.

Mechanism

Winner pays

Properties

GSP (generalized second price)

Just enough to hold its position over the next competitor

Simple, industry-standard; not strictly truthful

VCG (Vickrey-Clarke-Groves)

The welfare loss it imposes on all others

Truthful in theory; more complex, noise-sensitive

The model feeds allocation; pricing is auction logic layered on top. This separation is why the probability must be calibrated: GSP and VCG both compute the price from the predicted rates, so a miscalibrated pCVR doesn’t just mis-rank — it mis-bills.

Budget pacing as a controller

A campaign with a daily budget should spend smoothly across the day, not exhaust its budget by noon and miss the afternoon audience. Pacing is a control problem:

PID controllers are the industry default — compute the error between actual and target spend, and adjust a bid multiplier in real time (proportional + integral + derivative terms damp oscillation).
Probabilistic throttling sets a pacing_rate ∈ [0, 1] — e.g. 0.34 means a 34% chance the campaign even enters the auction this request.
MPC (model predictive control) looks ahead when supply is volatile (predictable traffic spikes).

Pacing must not corrupt calibration

Here is the trap that separates Staff from Senior. Pacing as a bid multiplier on eCPM is fine. But if pacing logic is folded into the predicted rate — say, someone scales pCVR down to slow a campaign — it corrupts calibration: the model’s pCVR is no longer a true probability, and now billing and every other consumer of that number are wrong. The rule is a hard boundary: pacing multiplies eCPM downstream of the model; it never touches the probability. The pCTR/pCVR coming out of the model stay true probabilities no matter how aggressively a campaign is being throttled.

Value-based / target-CPA bidding

Increasingly advertisers bid for outcomes, not clicks. Under target-CPA the system effectively bids:

bid_effective = target_CPA × pCVR

This makes the calibration SLO a dollar contract: if pCVR is miscalibrated by 20%, the effective bid is off by 20%, and the advertiser’s realized CPA misses target by 20% — the direct dollar consequence of Step 4’s SLO. The calibration band [0.95, 1.05] is not an abstract quality bar; it is the tolerance on every advertiser’s realized CPA.

Cross-team ownership (the Staff signal)

The distinguishing Staff behavior is naming the interfaces and their guardrails as explicit contracts with other teams:

Calibration SLO is the contract between the modeling team and the auction/billing team — they consume the probability as a price.
Funnel recall is the contract with the retrieval team — they guarantee L1 doesn't prune what L2 would value.
Attribution window is the contract with the billing/measurement team — the label target must equal the billing target.

Naming these interfaces, their owners, and their auto-rollback guardrails is what distinguishes Staff from a strong Senior who optimizes one model in isolation.

Rollout — metrics, experimentation, and online evaluation

Three metric families

A single number cannot tell you if this system is healthy; you need three families, and a change can win one while losing another.

Family

Metrics

What it protects

Ranking quality

AUC, log-loss, GAUC (group-AUC)

Does it order ads correctly

Calibration

ECE, per-slice calibration ratio

Is the probability a true price

Business

Revenue, advertiser ROI / realized-CPA, fill rate, retention

Does it make money without harming advertisers

GAUC (AUC computed per user/group then averaged) matters because global AUC can be inflated by easy cross-user ordering while per-user ranking is poor.

A/B with guardrails

Run budget-split or user-split experiments with guardrail metrics on per-slice calibration and pacing fairness, not just topline revenue. The failure to catch: a treatment that lifts revenue by silently overcharging CPA campaigns is a loss — advertisers churn next quarter. Watch advertiser-side harm (realized CPA drift) as a hard guardrail that can block a revenue-positive launch.

Counterfactual / replay evaluation

There is a real offline/online gap: AUC improvements offline routinely fail to move online revenue, because of feedback loops and selection bias (the model only sees outcomes on ads it chose to show). Counterfactual / replay evaluation and inverse-propensity weighting (IPW) estimate online impact before shipping — replay logged auctions under the new model, reweighting by the propensity of the logged action, to get an unbiased estimate without a full A/B.

Drift detection and retrain triggers

Monitor the prediction distribution, per-slice calibration ratio, and feature staleness. Trigger retrain or recalibration when calibration breaches the band or the conversion mix shifts (seasonality, a new advertiser cohort). And keep the maximization-bias guardrail from Step 4: monitor calibration conditioned on winning — realized rates among winners can exceed average predictions even for a globally-calibrated model, and that’s the population that actually gets billed.

Bottlenecks, edge cases, and failure modes

Cold-start (new advertiser / campaign / creative):

No click or conversion history → high-variance pCTR/pCVR.
Mitigate with content/embedding priors (use creative and category features when ID features are empty), an exploration budget (Thompson sampling / UCB) to gather data, and Platt-scaled per-slice recalibration for the sparse slice.

Feedback loops and selection bias:

The model only observes clicks/conversions on ads it chose to show, so its training data is shaped by its own past decisions — selection bias that compounds over time.
Mitigate with exploration, ESMM entire-space modeling, and propensity correction (IPW).

Click / conversion fraud and bots:

Inflate CTR and corrupt labels — a bot farm makes a creative look great and poisons training.
Filter invalid traffic upstream, exclude it from training, and monitor anomalous per-slice rates as an early signal.

Infra bottlenecks:

Embedding-table memory and feature-join tail latency dominate.
Shard embeddings across hosts, cache hot entities, bound the candidate count entering L2, and degrade gracefully to L1-only scores under load rather than blowing the latency budget.

Delayed-label edge cases:

Conversions beyond the attribution window are uncountable by design — accept the truncation bias (it's a definitional choice, matched to billing).
Duplicate-conversion dedup and cross-device attribution are measurement traps that silently bias pCVR — a conversion double-counted or attributed to the wrong device corrupts the label the model is calibrated against.

✓

Summary

Pillar 1 — the score is a price. Rank by eCPM = pCTR × pCVR × bid × value, and own calibration as a billing-critical SLO (ECE + per-slice calibration ratio) with guardrails and auto-rollback. A miscalibrated model bills wrong even when it ranks perfectly.

Pillar 2 — delayed labels are a design constraint, not a footnote. Choose waiting-window vs ingest-then-correct (DEFER / PU loss + importance sampling) by reasoning about the bias/variance/freshness tradeoff, and match the attribution window to billing so pCVR is calibrated to the target advertisers are actually charged against.

Pillar 3 — entire-space, multi-task value modeling. Use ESMM (pCTCVR = pCTR × pCVR over all impressions) to kill sample-selection bias, share embeddings to rescue the data-starved pCVR head, and MMoE towers to reduce task interference — addressing CVR sparsity head-on.

Clean boundaries are the Staff tell: pacing multiplies eCPM, never the probability; allocation (rank) is separate from pricing (GSP/VCG externality); and the calibration SLO, funnel recall, and attribution window are explicit contracts with the billing, retrieval, and measurement teams.

Through the role lenses: the SDE owns the funnel and join latency but must reason about why the output is a probability; the MLE owns calibration and delayed-label correctness and ties each to a revenue consequence; the switcher transfers serving/freshness instincts and builds the new muscle of “score = probability = price.”

If you remember one thing: a perfectly-ranked but miscalibrated ads model is worse than useless — it bills wrong and paces wrong. Calibration plus correct delayed labels are the whole game; the bigger model is secondary.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing

Builds a pCTR model and ranks by predicted score; treats the bid as an input multiplier.

Frames the score as a PRICE: ranks by eCPM = pCTR × pCVR × bid × value, and states upfront that the probability must be calibrated because billing and pacing consume it directly.

Calibration

Mentions Platt/isotonic recalibration and that cross-entropy loss encourages calibration.

Owns calibration as an SLO: per-slice ECE / predicted-over-actual ratio, isotonic for global + Platt for sparse new-advertiser slices, online recalibration cadence, and auto-rollback when calibration drifts.

Delayed conversions

Notes conversions are delayed and proposes a fixed waiting window before labeling.

Quantifies the bias/variance/freshness tradeoff: waiting-window latency vs ingest-then-correct (DEFER/fake-negative + importance sampling, PU loss), duplicated real-negatives, and T+1/T+3/T+7 snapshot retraining.

Multi-task / value modeling

Trains pCTR and pCVR as two separate models.

Uses entire-space modeling (ESMM: pCTCVR = pCTR × pCVR over all impressions) to kill sample-selection bias, shares embeddings, and uses MMoE task towers; reasons about CVR data sparsity.

Serving & funnel consistency

Describes retrieval → ranking and a feature store join under a latency budget.

Designs L1/L2 consistency so the cheap retrieval model doesn't prune candidates the ranker would value; monitors funnel recall; bounds feature-join tail latency and staleness.

Auction & pacing

Ranks by eCPM and charges the bid or a second price.

Separates allocation (rank) from pricing (GSP/VCG externality), and integrates budget pacing (PID / probabilistic throttling) as a multiplier that must not corrupt calibration.

Metrics & ops

Tracks AUC/log-loss offline and CTR online.

Distinguishes ranking metrics (AUC) from calibration metrics (ECE, calib ratio) from business metrics (revenue, advertiser ROI); runs A/B with guardrails on per-slice calibration and pacing fairness; plans drift detection + retrain triggers.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →

Frame the problem — the score is a price, not just a ranking

Requirements, scale, and the eCPM objective

Estimation and architecture — retrieval → L1 → L2 → auction

The models — pCTR, pCVR, and entire-space multi-task value modeling

Data model for labels and calibration — the billing-critical SLO

High-level training architecture — delayed and sparse conversions

Deep dive — auction integration, budget pacing, and value-based bidding

Rollout — metrics, experimentation, and online evaluation

Bottlenecks, edge cases, and failure modes

Summary

Rubric — Senior vs Staff

Related questions

Want more breakdowns like this?