Design a Multi-Stage Short-Video Feed Ranking System
A Staff-level walk through the four-stage feed funnel — billions of items narrowed to ~10 per page through retrieval, pre-rank, late-rank, and re-rank — the kind of system Instagram Reels, TikTok, YouTube Shorts, and Snap Spotlight build and interview AI/ML roles on. The core tension is serving (sub-50ms, scatter-gather ANN over sharded billions) against modeling (an MMoE multi-task value model blending engagement heads, plus correcting the feedback loop where the ranker only ever sees its own past choices). Staff is won on calibrated value blending, pre-rank/late-rank consistency, and explicit exposure-bias correction with logged propensities.
Scope — Frame the problem & who's asking
This isn’t “build a recommender” — it’s “narrow billions of items to ~10 in under 50ms while learning an objective you never directly observe.” The funnel is the systems half; the value model and its feedback loop are the ML half.
Short-video feeds at Instagram Reels, TikTok, YouTube Shorts, and Snap Spotlight are the canonical setting for this question, and the AI/ML roles behind them get interviewed on exactly this funnel. Before drawing boxes, pin down what is actually being tested and by whom.
Who asks this & what they probe
Different interviewers stress different halves of the system. Knowing which lens you’re under tells you where to spend your 60 minutes.
The real problem
The ask: an infinite, personalized short-video feed that serves roughly 10 items per page from a corpus of billions, ranked by predicted value to the user, under a hard interactive latency budget. The hard parts are not “use a recommender.” They are (1) compute — you cannot score billions of items with a heavy model per request, so you need a funnel; and (2) learning — there is no single observed label that equals “value,” and the only training data you ever collect is engagement on items your own ranker already chose to show.
Functional vs non-functional requirements
Functional:
- Infinite personalized feed, paginated ~10 items at a time.
- Rank candidates by a predicted value score blending multiple engagement signals.
- Enforce integrity/safety filters (a hard gate, not a soft weight).
- Diversity within a page; freshness so new items and new creators are reachable.
Non-functional:
- p99 end-to-end under 50ms for ranking compute (excludes video CDN delivery, which is a separate streaming concern).
- Billions of DAU; corpus on the order of 10^9 videos; tens of thousands of requests/sec/region.
- Graceful degradation under overload — never blow the page budget.
Scale & SLO assumptions
The load-bearing tension
Everything below is a negotiation between two forces: the serving budget (sub-50ms, scatter-gather ANN over sharded billions) and a learned value objective that is biased by the model’s own past exposures — a closed loop where today’s model trains on data that yesterday’s model selected. I’ll return to this tension repeatedly; it’s where Staff is won.
Requirements — The funnel: fan-out & latency budget
Why a funnel exists at all
The funnel is a compute-allocation decision, not an accident of history. Per-request cost is roughly candidates × per-item model cost. Scoring 10^9 items with a heavy DNN per request is seven to eight orders of magnitude over any interactive budget. So the system spends each stage’s compute by trading model cost against candidate count: cheap models over many candidates early, expensive models over few candidates late. Predictive power is prioritized over scalability precisely as the candidate set shrinks.
The four stages
Retrieval (sourcing): turn billions into ~thousands. Two-tower embeddings + ANN, plus non-embedding sources. Cheapest per item.
Pre-rank (early-rank): ~thousands to ~hundreds. A vector-product or small MLP scorer — must be cheap because it runs over thousands of items.
Late-rank: ~hundreds to ~tens. The heavy MMoE multi-task DNN. This is where most predictive power and most latency live.
Re-rank: ~tens to ~10. List-wise diversity (DPP/MMR) plus integrity gating. Turns pointwise scores into a coherent page.
A latency budget that adds up
As a defensible external anchor, published two-tower retrieval systems report roughly 40ms p99 on CPU at ~20k QPS for first-stage retrieval; with a precomputed item-embedding cache and a single ANN call, an 8ms retrieval slice within a tighter overall budget is realistic.
Degradation path
On overload, shed work in priority order, never the SLO: reduce the pre-rank candidate count (fewer items into late-rank), serve a cached candidate list for the request, or fall back to a cheaper scorer for late-rank. The page renders on time with slightly worse ranking — graceful, not catastrophic.
Estimation — Retrieval & candidate sourcing
Two-tower model
The retrieval workhorse is a dual-encoder: a user tower and an item tower trained to put a user near the items they’ll engage with in a shared embedding space. The item tower is computed offline and the embeddings cached; the user tower is computed online from request features. So per request, retrieval is one user-embedding forward pass plus an ANN lookup — that’s what keeps it sub-10ms. Training uses in-batch negatives / sampled softmax so each positive is contrasted against many cheap negatives.
Don’t rely on a single ANN source
A single personalized ANN both narrows diversity and feeds the closed loop. Blend multiple sources, then dedup and union:
- Embedding ANN — personalized two-tower retrieval.
- Follow / social-graph — items from accounts the user follows or close-network signals.
- Trending / popular — globally or regionally hot content.
- Fresh / cold-start — newly uploaded items and new-creator pools with little interaction history.
Source diversity is itself a hedge against exposure bias: if every candidate comes from one personalized model, the system can only reinforce what it already believes.
ANN index choice is a Staff-level trade-off
Pick the index by corpus size, RAM budget, and the recall the downstream ranker actually needs — not by default.
Sharding & scatter-gather
A billion-item index doesn’t fit on one machine, so shard it (random/hash sharding) and scatter-gather: fan the query to all shards, search in parallel, merge top-K at a query service.
The item-embedding cache makes this cheap
Sub-10ms retrieval is only possible because item embeddings are precomputed and cached — the hot path never re-embeds items. A backfill pipeline re-embeds only new or updated items, so the request-time work is a single user embedding plus the ANN call. State the recall-vs-latency-vs-RAM trade explicitly: that’s the engineering judgment the interviewer is listening for.
API design — The multi-task value model
This is the ML core: how raw, partial engagement signals become one number the funnel can sort on.
Heads & labels
The model predicts several correlated engagement targets plus watch-time:
Labels must be defined precisely, not hand-waved. Watch-through needs an explicit threshold with duration normalization so a long video isn’t penalized just for being long. Negatives come from in-feed skips plus sampled non-impressions; dedup repeated impressions of the same item so they don’t dominate.
Architecture: MMoE / PLE
Use MMoE (Multi-gate Mixture-of-Experts: a shared expert pool with a per-task gating network) or PLE (Progressive Layered Extraction, which adds task-specific experts on top of shared ones). Both let tasks share representation where objectives agree and specialize where they conflict — the standard industrial pattern for exactly this multi-objective feed problem. Multi-task learning here beats training each head alone because the objectives are correlated, so a shared representation generalizes better and amortizes serving cost into one forward pass.
Watch-time via weighted logistic regression
Don’t regress raw seconds — watch-time is heavy-tailed and duration-biased, so a regression chases outliers and rewards long videos. Instead use weighted logistic regression (YouTube’s method, also used at Kuaishou): positive impressions are weighted by observed watch time, negatives carry weight 1. The learned odds then approximate expected watch time, giving a well-behaved, bounded head.
The value blend
Combine the calibrated heads into one score:
It’s a weighted sum of calibrated head outputs. The negative term explicitly penalizes predicted skips so the score isn’t purely additive optimism.
Why blend instead of ranking on one label
No single observable label equals “value.” Likes are sparse and demographically biased; raw watch-time over-rewards long videos; clicks over-reward clickbait. Blending hedges across these failure modes — and, crucially, lets the product tune the trade-off.
Staff move: the weights w_i are a product/policy lever, not a fixed hyperparameter. They encode the platform’s definition of value (engagement vs. creator growth vs. session length) and are tuned via online experiments along a Pareto front — raising w_share for virality trades against watch-time, and that trade is a business decision surfaced through the weights.
Data model — Calibration & the blended score
Why calibration matters for a weighted sum
A weighted sum only means something if the heads live on a comparable, calibrated scale. If P(like) and P(share) aren’t true probabilities, the weights w_i can’t be interpreted, and an over-confident head silently dominates the blend regardless of its real value. This is the key difference from a generic single-task ranker, where only monotonicity of the score matters — here absolute calibration matters because you’re summing across heads and comparing scores across surfaces.
Per-head calibration
Calibrate each head on a held-out set with Platt scaling (a logistic fit on the scores) or isotonic regression (a monotonic non-parametric fit) so that, e.g., predicted P(like) = 0.1 actually fires ~10% of the time. The watch-time head is a magnitude rather than a probability, so the weights w_i also absorb the scale gap between it and the probability heads.
Composite calibration
Calibrating the parts is necessary but not sufficient. Recent work flags that combining individually-calibrated models does not guarantee a calibrated composite, so calibrate the blended score against the composite outcome as well — not just each head in isolation.
Worked intuition
Suppose P(share) is systematically over-predicted by 2x. Then every unit of w_share silently double-counts shares: raising it to “value sharing more” actually injects twice the intended weight, and the ranking tilts toward share-baity content the team never meant to favor. Calibration is precisely what makes the weights mean what you think they mean.
Calibration drifts — monitor it
The population and the feedback loop both shift the predicted-vs-actual relationship over time, so recalibrate on a rolling window and treat calibration as a first-class metric: track reliability curves and Expected Calibration Error (ECE) alongside AUC, not as an afterthought.
High-level architecture — Re-rank: diversity, freshness, integrity
Why a list-wise re-rank
Pointwise value scores rank each item in isolation and ignore the page as a whole — so the top-N by V might be five near-duplicate clips from one creator. Re-rank takes the top-N and optimizes a list-wise objective: diversity, freshness, fatigue, and business rules. This is where pointwise scores become an actual list.
DPP vs MMR
DPP (Determinantal Point Processes) is YouTube’s CIKM’18 production approach for feed diversity: a kernel encodes item relevance and pairwise similarity, served with approximate greedy inference, and it drove both short- and long-term engagement gains on the homepage. MMR (Maximal Marginal Relevance) is the simpler greedy alternative — cheaper, but it only looks at similarity to the running selection rather than the set as a whole.
Integrity is a hard gate, not a soft weight
Borderline or violating content is removed or demoted before final selection — this is a non-negotiable filter sitting directly in the serving path, not a term in the value blend. You cannot let a high V “buy back” a safety violation.
Other re-rank constraints
- Creator/source de-duplication — avoid three clips from one creator in a row.
- Freshness boosts — give new items a chance to accumulate signal.
- Per-user fatigue — suppress already-seen items and recently-shown topics.
Exploration slots live here
Re-rank is also where exploration slots are injected: a small fraction of positions reserved for under-exposed or cold-start candidates. That’s the hook for closing the feedback loop — covered next.
Deep dive — Closing the loop: exposure bias
WHERE STAFF IS WONThis is the Staff differentiator. Everything above is a competent funnel; this section is what separates a Staff answer from a Senior one. The core realization: the model partly chooses its own training data. Tomorrow’s model trains on engagement collected from items today’s model decided to show — so the deliverable isn’t just a better scorer, it’s a data-collection-and-correction policy.
Name the biases precisely
Vague gestures at “users only see recommended items” aren’t enough. Name the mechanisms:
All three are amplified by the feedback loop: each retrain bakes yesterday’s selection into today’s training distribution, and left uncorrected the system narrows toward a self-reinforcing filter bubble.
The logging contract comes first
Debiasing is impossible to retrofit without the right logs, so design the data contract before the model. At serving time, log:
- Serving propensity — P(item shown | request), the probability the policy assigned to showing this item.
- The full, un-truncated candidate set — not just the ~10 shown, so you can reason about items that could have been shown but weren't.
- Position — the slot each shown item occupied.
If you don’t log propensities and the un-truncated candidates, no amount of later modeling can recover unbiased estimates. This is the Staff “design the contract first” instinct.
Correction methods
- Inverse Propensity Scoring (IPS): weight each logged interaction by 1 / propensity, so over-exposed items are down-weighted and under-exposed interactions are up-weighted, yielding a less biased estimate of true value. Clip extreme weights to control variance.
- Position / debias tower: train a small auxiliary tower on position (and other exposure features) jointly with the main model, then drop it at serving so ranking reflects relevance rather than slot. This factorizes "got engagement because it was good" from "got engagement because it was on top."
- Counterfactual / off-policy evaluation: use logged propensities to estimate offline how a new policy would have performed, so you can vet changes before exposing users.
Exploration is mandatory
Correction alone can’t fix data you never collected. Reserve a small slice of impressions — epsilon-greedy or bandit slots (the re-rank slots from Step 5) — for under-exposed and cold-start items. This keeps the system collecting roughly unbiased signal on the tail and stops the corpus from ossifying into a filter bubble. The cost is a small, bounded engagement tax now in exchange for a healthier loop later.
Drift detection
Watch for the loop collapsing before it shows up as a metric regression:
- Offline-up / online-down divergence — offline AUC rising while online engagement falls is a classic feedback-loop tell (and a sample-selection-bias tell).
- Exposure concentration — track a Gini coefficient over creators/items; a rising Gini means the system is concentrating exposure.
- New-creator reach — monitor as an explicit fairness guardrail so new creators stay enterable.
The one line that wins Staff
The closed loop means the model partly chooses its own training data — so the real deliverable is a data-collection-and-correction policy, not just a better scorer. Lead with that framing and the rest of this section reads as the implementation.
Rollout strategy — Funnel consistency & the feature store
Pre-rank to late-rank consistency
Two independently-trained rankers fight each other. If the cheap pre-ranker orders items differently from the heavy late-ranker, the pre-ranker filters out items the late-ranker would have loved — recall wasted before the good model ever sees it. Fix it with distillation: train the pre-ranker (student) to mimic the late-ranker’s (teacher) ordering, so their rankings align and the funnel stops discarding gold. The goal isn’t two accurate models; it’s two consistent ones.
Sample-selection bias in the funnel
Each stage trains only on items that passed the previous stage (impressions), but at serving each stage must score all recalled candidates — including long-tail and cold items it never saw in training. Those items are out-of-distribution, so naive training over-trusts the head. Mitigate by adding entire-chain / cross-stage training samples (items sampled from earlier stages, not only impressions) so each model sees the distribution it actually scores. ESMM-style entire-space modeling is a concrete, known pattern for the impression-to-watch-through analog of click-to-conversion SSB.
Feature store in the hot path
Late-rank needs fresh user, item, and cross features for every candidate, and those reads sit inside the ~6ms feature-fetch budget. A single slow read silently blows the page SLO, so:
- Batch feature reads across candidates into few round-trips.
- Cache hot user/item features near compute.
Train/serve feature parity
The same feature transforms must run offline and online, computed point-in-time correct (using only data available at request time). Otherwise you get training/serving skew — a top cause of “great offline, bad online.”
Feature freshness vs cost
Be explicit about which features are real-time vs batch:
- Real-time — last-N watched, current-session signals; updated per event.
- Batch — heavy item features, aggregate creator stats; updated on a backfill cadence.
Bottlenecks, observability & evolution — Eval, retraining & guardrails
Offline eval & splits
Use time-based splits, never random: train on the past, validate on the future, with point-in-time feature snapshots. Random splits leak future engagement into training and inflate every offline metric. Treat offline as a filter, not a decision — the closed loop makes offline/online divergence common.
Online A/B + guardrails
Ship via A/B with engagement as the primary metric (watch-time, retention, DAU) but gate on guardrails — integrity violation rate, diversity/exposure concentration, new-creator reach, report/skip rates. Ship only if the primary metric wins and every guardrail holds. A watch-time win that tanks new-creator reach is not a ship.
Retrain cadence
State a defensible number rather than “frequently.” Run incremental/continual training on a short window — hours to a day — to track fast preference and content drift, with full retrains less often (e.g., weekly) to reset accumulated drift. The short window is what keeps the model current with what’s trending today.
Monitor the loop itself
Feed Step 6’s signals back into retraining decisions: exposure Gini, the fraction of impressions coming from exploration slots, and offline-up / online-down alerts. These tell you whether the loop is healthy or quietly collapsing.
Cold-start & freshness as ongoing ops
Keep dedicated fresh-content pools and exploration running continuously so new items and creators stay enterable. Without them the corpus ossifies and the feed becomes a filter bubble — the failure mode this whole design exists to prevent.
Summary
A Staff answer nails six load-bearing moves. Each maps a Senior instinct onto its Staff upgrade.
Funnel with real numbers. Senior: “retrieve then rank.” Staff: billions to thousands to hundreds to tens to ~10, with a latency budget that sums to under 50ms and model complexity rising as candidate count falls — because cost is candidates × per-item model cost.
One calibrated value score. Senior: predict a few probabilities and sum them. Staff: an MMoE/PLE multi-task model with watch-time via weighted logistic regression, heads calibrated (Platt/isotonic) so a weighted sum is meaningful, and the weights framed as a tunable Pareto/product lever — not a fixed hyperparameter.
The closed-loop exposure-bias story, end to end. Senior: “users only see recommended items.” Staff: name exposure, position, and popularity bias; design the propensity-logging contract first; correct with IPS plus a drop-at-serving debias tower; add exploration slots; and detect drift via exposure Gini and offline-up/online-down divergence. This is the Step 6 differentiator.
Funnel consistency. Senior: two separate rankers. Staff: distill late into early for rank consistency, fix sample-selection bias with entire-chain samples, and hold train/serve feature parity through a feature store kept inside the hot-path budget.
Disciplined ops. Senior: track AUC, ship if it rises. Staff: time-based splits with point-in-time snapshots, guardrailed A/B (integrity, diversity, creator fairness), and a defensible incremental-retrain cadence (hours-to-a-day) that keeps the loop from collapsing into a filter bubble.
The one-sentence Staff signal: The model partly chooses its own training data, so the deliverable is a data-collection-and-correction policy, not just a better scorer.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.