← Back to all questions
AI System DesignStaffRanking FunnelMulti-Task Learning

Design a Multi-Stage Short-Video Feed Ranking System

A Staff-level walk through the four-stage feed funnel — billions of items narrowed to ~10 per page through retrieval, pre-rank, late-rank, and re-rank — the kind of system Instagram Reels, TikTok, YouTube Shorts, and Snap Spotlight build and interview AI/ML roles on. The core tension is serving (sub-50ms, scatter-gather ANN over sharded billions) against modeling (an MMoE multi-task value model blending engagement heads, plus correcting the feedback loop where the ranker only ever sees its own past choices). Staff is won on calibrated value blending, pre-rank/late-rank consistency, and explicit exposure-bias correction with logged propensities.

Level
Staff
Category
AI System Design
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Can you draw the four-stage funnel with concrete fan-out numbers and a latency budget that adds up to <50ms?
·Do you blend engagement heads into one calibrated value score — and justify the weights — rather than ranking on a single label?
·Can you name the closed-loop exposure bias and a concrete correction (logged propensities, IPS, exploration)?
·Do you keep early-rank and late-rank consistent, and explain sample-selection bias from training only on impressions?
★ STAFF-LEVEL SIGNALS
Treats the value-model weights as a product/policy lever (Pareto trade-offs, calibration) — not a fixed hyperparameter
Designs the data-logging contract first (propensities, un-truncated candidates) because debiasing is impossible without it
Reasons about pre-rank↔late-rank rank consistency and distillation, not just two independent models
Quantifies the retrain cadence and drift detection that keep the closed loop from collapsing into a filter bubble
0

Scope — Frame the problem & who's asking

This isn’t “build a recommender” — it’s “narrow billions of items to ~10 in under 50ms while learning an objective you never directly observe.” The funnel is the systems half; the value model and its feedback loop are the ML half.

Short-video feeds at Instagram Reels, TikTok, YouTube Shorts, and Snap Spotlight are the canonical setting for this question, and the AI/ML roles behind them get interviewed on exactly this funnel. Before drawing boxes, pin down what is actually being tested and by whom.

Who asks this & what they probe

Different interviewers stress different halves of the system. Knowing which lens you’re under tells you where to spend your 60 minutes.

Interviewer
What they probe
What "good" looks like
SDE
Serving funnel, fan-out math, latency budget, sharded ANN scatter-gather, feature store in the hot path
A budget that sums to under 50ms; tail latency bounded; degradation path
MLE
Multi-task value model, label definition, watch-time modeling, calibration, feedback-loop correction
One calibrated value score; weighted-LR watch-time; named biases + fixes
Switcher (SDE to AI)
Whether familiar infra (caches, p99, sharding) maps onto an unfamiliar ML objective; why one label isn't enough
Anchors ML in known systems; explains why training data is biased by the model's own choices

The real problem

The ask: an infinite, personalized short-video feed that serves roughly 10 items per page from a corpus of billions, ranked by predicted value to the user, under a hard interactive latency budget. The hard parts are not “use a recommender.” They are (1) compute — you cannot score billions of items with a heavy model per request, so you need a funnel; and (2) learning — there is no single observed label that equals “value,” and the only training data you ever collect is engagement on items your own ranker already chose to show.

Functional vs non-functional requirements

Functional:

  • Infinite personalized feed, paginated ~10 items at a time.
  • Rank candidates by a predicted value score blending multiple engagement signals.
  • Enforce integrity/safety filters (a hard gate, not a soft weight).
  • Diversity within a page; freshness so new items and new creators are reachable.

Non-functional:

  • p99 end-to-end under 50ms for ranking compute (excludes video CDN delivery, which is a separate streaming concern).
  • Billions of DAU; corpus on the order of 10^9 videos; tens of thousands of requests/sec/region.
  • Graceful degradation under overload — never blow the page budget.

Scale & SLO assumptions

Quantity
Assumption
Corpus
~10^9 videos
Per-request funnel
billions to ~thousands to ~hundreds to ~tens to ~10 shown
Latency SLO
p99 ranking compute under 50ms
Throughput
tens of thousands QPS per region
Real-world anchor
Kuaishou is publicly ~400M DAU as a scale reference

The load-bearing tension

Everything below is a negotiation between two forces: the serving budget (sub-50ms, scatter-gather ANN over sharded billions) and a learned value objective that is biased by the model’s own past exposures — a closed loop where today’s model trains on data that yesterday’s model selected. I’ll return to this tension repeatedly; it’s where Staff is won.

1

Requirements — The funnel: fan-out & latency budget

Why a funnel exists at all

The funnel is a compute-allocation decision, not an accident of history. Per-request cost is roughly candidates × per-item model cost. Scoring 10^9 items with a heavy DNN per request is seven to eight orders of magnitude over any interactive budget. So the system spends each stage’s compute by trading model cost against candidate count: cheap models over many candidates early, expensive models over few candidates late. Predictive power is prioritized over scalability precisely as the candidate set shrinks.

The four stages

Retrieval (sourcing): turn billions into ~thousands. Two-tower embeddings + ANN, plus non-embedding sources. Cheapest per item.

Pre-rank (early-rank): ~thousands to ~hundreds. A vector-product or small MLP scorer — must be cheap because it runs over thousands of items.

Late-rank: ~hundreds to ~tens. The heavy MMoE multi-task DNN. This is where most predictive power and most latency live.

Re-rank: ~tens to ~10. List-wise diversity (DPP/MMR) plus integrity gating. Turns pointwise scores into a coherent page.

Stage
In
Out
Model class
Latency budget
Retrieval
~10^9
~thousands
Two-tower + ANN
~8ms
Pre-rank
~thousands
~hundreds
Vector-product / small MLP
~5ms
Late-rank
~hundreds
~tens
Heavy MMoE DNN
~18ms
Re-rank
~tens
~10
DPP/MMR + integrity
~5ms

A latency budget that adds up

Component
Budget
Retrieval (user embed + ANN)
8ms
Feature fetch (store reads)
6ms
Pre-rank
5ms
Late-rank
18ms
Re-rank
5ms
Overhead (RPC, merge, serde)
~8ms
Total
under 50ms p99

As a defensible external anchor, published two-tower retrieval systems report roughly 40ms p99 on CPU at ~20k QPS for first-stage retrieval; with a precomputed item-embedding cache and a single ANN call, an 8ms retrieval slice within a tighter overall budget is realistic.

Degradation path

On overload, shed work in priority order, never the SLO: reduce the pre-rank candidate count (fewer items into late-rank), serve a cached candidate list for the request, or fall back to a cheaper scorer for late-rank. The page renders on time with slightly worse ranking — graceful, not catastrophic.

2

Estimation — Retrieval & candidate sourcing

Two-tower model

The retrieval workhorse is a dual-encoder: a user tower and an item tower trained to put a user near the items they’ll engage with in a shared embedding space. The item tower is computed offline and the embeddings cached; the user tower is computed online from request features. So per request, retrieval is one user-embedding forward pass plus an ANN lookup — that’s what keeps it sub-10ms. Training uses in-batch negatives / sampled softmax so each positive is contrasted against many cheap negatives.

Don’t rely on a single ANN source

A single personalized ANN both narrows diversity and feeds the closed loop. Blend multiple sources, then dedup and union:

  • Embedding ANN — personalized two-tower retrieval.
  • Follow / social-graph — items from accounts the user follows or close-network signals.
  • Trending / popular — globally or regionally hot content.
  • Fresh / cold-start — newly uploaded items and new-creator pools with little interaction history.

Source diversity is itself a hedge against exposure bias: if every candidate comes from one personalized model, the system can only reinforce what it already believes.

ANN index choice is a Staff-level trade-off

Pick the index by corpus size, RAM budget, and the recall the downstream ranker actually needs — not by default.

Index
Mechanism
Strength
Cost / caveat
HNSW
In-memory proximity graph
Latency-optimal, high recall
RAM-heavy; hard at many billions
IVF-PQ
Inverted lists + product quantization
Memory-efficient, billion-scale on one box
Some recall loss from quantization
ScaNN
Anisotropic PQ tuned for inner product
Strong recall/latency for MIPS
Tuning complexity
DiskANN (Vamana)
Graph in RAM, codes on SSD
Fits billion-scale that won't fit in RAM
SSD I/O adds latency

Sharding & scatter-gather

A billion-item index doesn’t fit on one machine, so shard it (random/hash sharding) and scatter-gather: fan the query to all shards, search in parallel, merge top-K at a query service.

query_emb = user_tower(request_features) # online
parallel for shard in index_shards: # scatter
shard_topk[shard] = shard.ann_search(query_emb, k=K_shard)
candidates = merge_topk(shard_topk, k=K) # gather + merge
candidates = dedup(union(candidates,
graph_src, trending_src, fresh_src))
return candidates # ~thousands

The item-embedding cache makes this cheap

Sub-10ms retrieval is only possible because item embeddings are precomputed and cached — the hot path never re-embeds items. A backfill pipeline re-embeds only new or updated items, so the request-time work is a single user embedding plus the ANN call. State the recall-vs-latency-vs-RAM trade explicitly: that’s the engineering judgment the interviewer is listening for.

3

API design — The multi-task value model

This is the ML core: how raw, partial engagement signals become one number the funnel can sort on.

Heads & labels

The model predicts several correlated engagement targets plus watch-time:

Head
Label definition
Loss
P(watch-through)
Watched past a threshold (% of video or absolute seconds, duration-normalized)
Binary cross-entropy
P(like)
Explicit like within session
Binary cross-entropy
P(share)
Share / send action
Binary cross-entropy
P(comment)
Comment posted
Binary cross-entropy
P(follow)
Followed creator from item
Binary cross-entropy
E[watch-time]
Observed seconds watched
Weighted logistic regression (below)

Labels must be defined precisely, not hand-waved. Watch-through needs an explicit threshold with duration normalization so a long video isn’t penalized just for being long. Negatives come from in-feed skips plus sampled non-impressions; dedup repeated impressions of the same item so they don’t dominate.

Architecture: MMoE / PLE

Use MMoE (Multi-gate Mixture-of-Experts: a shared expert pool with a per-task gating network) or PLE (Progressive Layered Extraction, which adds task-specific experts on top of shared ones). Both let tasks share representation where objectives agree and specialize where they conflict — the standard industrial pattern for exactly this multi-objective feed problem. Multi-task learning here beats training each head alone because the objectives are correlated, so a shared representation generalizes better and amortizes serving cost into one forward pass.

Watch-time via weighted logistic regression

Don’t regress raw seconds — watch-time is heavy-tailed and duration-biased, so a regression chases outliers and rewards long videos. Instead use weighted logistic regression (YouTube’s method, also used at Kuaishou): positive impressions are weighted by observed watch time, negatives carry weight 1. The learned odds then approximate expected watch time, giving a well-behaved, bounded head.

The value blend

Combine the calibrated heads into one score:

V = w1 * P(watch_through)
+ w2 * E[watch_time]
+ w3 * P(like)
+ w4 * P(share)
+ w5 * P(comment)
- w6 * P(skip / negative)

It’s a weighted sum of calibrated head outputs. The negative term explicitly penalizes predicted skips so the score isn’t purely additive optimism.

Why blend instead of ranking on one label

No single observable label equals “value.” Likes are sparse and demographically biased; raw watch-time over-rewards long videos; clicks over-reward clickbait. Blending hedges across these failure modes — and, crucially, lets the product tune the trade-off.

Staff move: the weights w_i are a product/policy lever, not a fixed hyperparameter. They encode the platform’s definition of value (engagement vs. creator growth vs. session length) and are tuned via online experiments along a Pareto front — raising w_share for virality trades against watch-time, and that trade is a business decision surfaced through the weights.

4

Data model — Calibration & the blended score

Why calibration matters for a weighted sum

A weighted sum only means something if the heads live on a comparable, calibrated scale. If P(like) and P(share) aren’t true probabilities, the weights w_i can’t be interpreted, and an over-confident head silently dominates the blend regardless of its real value. This is the key difference from a generic single-task ranker, where only monotonicity of the score matters — here absolute calibration matters because you’re summing across heads and comparing scores across surfaces.

Per-head calibration

Calibrate each head on a held-out set with Platt scaling (a logistic fit on the scores) or isotonic regression (a monotonic non-parametric fit) so that, e.g., predicted P(like) = 0.1 actually fires ~10% of the time. The watch-time head is a magnitude rather than a probability, so the weights w_i also absorb the scale gap between it and the probability heads.

Composite calibration

Calibrating the parts is necessary but not sufficient. Recent work flags that combining individually-calibrated models does not guarantee a calibrated composite, so calibrate the blended score against the composite outcome as well — not just each head in isolation.

Worked intuition

Suppose P(share) is systematically over-predicted by 2x. Then every unit of w_share silently double-counts shares: raising it to “value sharing more” actually injects twice the intended weight, and the ranking tilts toward share-baity content the team never meant to favor. Calibration is precisely what makes the weights mean what you think they mean.

Calibration drifts — monitor it

The population and the feedback loop both shift the predicted-vs-actual relationship over time, so recalibrate on a rolling window and treat calibration as a first-class metric: track reliability curves and Expected Calibration Error (ECE) alongside AUC, not as an afterthought.

5

High-level architecture — Re-rank: diversity, freshness, integrity

Why a list-wise re-rank

Pointwise value scores rank each item in isolation and ignore the page as a whole — so the top-N by V might be five near-duplicate clips from one creator. Re-rank takes the top-N and optimizes a list-wise objective: diversity, freshness, fatigue, and business rules. This is where pointwise scores become an actual list.

DPP vs MMR

Method
How it works
Trade-off
DPP
Kernel of relevance x pairwise similarity; selects a diverse high-value subset via greedy approx inference
List-aware, principled; heavier to tune/serve
MMR
Greedy: each pick maximizes relevance minus max similarity to already-selected
Cheap and simple; less globally list-aware

DPP (Determinantal Point Processes) is YouTube’s CIKM’18 production approach for feed diversity: a kernel encodes item relevance and pairwise similarity, served with approximate greedy inference, and it drove both short- and long-term engagement gains on the homepage. MMR (Maximal Marginal Relevance) is the simpler greedy alternative — cheaper, but it only looks at similarity to the running selection rather than the set as a whole.

Integrity is a hard gate, not a soft weight

Borderline or violating content is removed or demoted before final selection — this is a non-negotiable filter sitting directly in the serving path, not a term in the value blend. You cannot let a high V “buy back” a safety violation.

Other re-rank constraints

  • Creator/source de-duplication — avoid three clips from one creator in a row.
  • Freshness boosts — give new items a chance to accumulate signal.
  • Per-user fatigue — suppress already-seen items and recently-shown topics.

Exploration slots live here

Re-rank is also where exploration slots are injected: a small fraction of positions reserved for under-exposed or cold-start candidates. That’s the hook for closing the feedback loop — covered next.

6

Deep dive — Closing the loop: exposure bias

WHERE STAFF IS WON

This is the Staff differentiator. Everything above is a competent funnel; this section is what separates a Staff answer from a Senior one. The core realization: the model partly chooses its own training data. Tomorrow’s model trains on engagement collected from items today’s model decided to show — so the deliverable isn’t just a better scorer, it’s a data-collection-and-correction policy.

Name the biases precisely

Vague gestures at “users only see recommended items” aren’t enough. Name the mechanisms:

Bias
Mechanism
Correction
Exposure / selection
Users only engage with items the ranker chose; unobserved is not disliked
IPS reweighting; log full candidate set
Position
Higher slots get more engagement regardless of relevance
Position/debias tower; log slot index
Popularity
Popular items over-exposed, accruing yet more engagement
IPS down-weighting; exploration of tail

All three are amplified by the feedback loop: each retrain bakes yesterday’s selection into today’s training distribution, and left uncorrected the system narrows toward a self-reinforcing filter bubble.

The logging contract comes first

Debiasing is impossible to retrofit without the right logs, so design the data contract before the model. At serving time, log:

  • Serving propensity — P(item shown | request), the probability the policy assigned to showing this item.
  • The full, un-truncated candidate set — not just the ~10 shown, so you can reason about items that could have been shown but weren't.
  • Position — the slot each shown item occupied.

If you don’t log propensities and the un-truncated candidates, no amount of later modeling can recover unbiased estimates. This is the Staff “design the contract first” instinct.

Correction methods

  • Inverse Propensity Scoring (IPS): weight each logged interaction by 1 / propensity, so over-exposed items are down-weighted and under-exposed interactions are up-weighted, yielding a less biased estimate of true value. Clip extreme weights to control variance.
  • Position / debias tower: train a small auxiliary tower on position (and other exposure features) jointly with the main model, then drop it at serving so ranking reflects relevance rather than slot. This factorizes "got engagement because it was good" from "got engagement because it was on top."
  • Counterfactual / off-policy evaluation: use logged propensities to estimate offline how a new policy would have performed, so you can vet changes before exposing users.

Exploration is mandatory

Correction alone can’t fix data you never collected. Reserve a small slice of impressions — epsilon-greedy or bandit slots (the re-rank slots from Step 5) — for under-exposed and cold-start items. This keeps the system collecting roughly unbiased signal on the tail and stops the corpus from ossifying into a filter bubble. The cost is a small, bounded engagement tax now in exchange for a healthier loop later.

Drift detection

Watch for the loop collapsing before it shows up as a metric regression:

  • Offline-up / online-down divergence — offline AUC rising while online engagement falls is a classic feedback-loop tell (and a sample-selection-bias tell).
  • Exposure concentration — track a Gini coefficient over creators/items; a rising Gini means the system is concentrating exposure.
  • New-creator reach — monitor as an explicit fairness guardrail so new creators stay enterable.

The one line that wins Staff

The closed loop means the model partly chooses its own training data — so the real deliverable is a data-collection-and-correction policy, not just a better scorer. Lead with that framing and the rest of this section reads as the implementation.

7

Rollout strategy — Funnel consistency & the feature store

Pre-rank to late-rank consistency

Two independently-trained rankers fight each other. If the cheap pre-ranker orders items differently from the heavy late-ranker, the pre-ranker filters out items the late-ranker would have loved — recall wasted before the good model ever sees it. Fix it with distillation: train the pre-ranker (student) to mimic the late-ranker’s (teacher) ordering, so their rankings align and the funnel stops discarding gold. The goal isn’t two accurate models; it’s two consistent ones.

Sample-selection bias in the funnel

Each stage trains only on items that passed the previous stage (impressions), but at serving each stage must score all recalled candidates — including long-tail and cold items it never saw in training. Those items are out-of-distribution, so naive training over-trusts the head. Mitigate by adding entire-chain / cross-stage training samples (items sampled from earlier stages, not only impressions) so each model sees the distribution it actually scores. ESMM-style entire-space modeling is a concrete, known pattern for the impression-to-watch-through analog of click-to-conversion SSB.

Feature store in the hot path

Late-rank needs fresh user, item, and cross features for every candidate, and those reads sit inside the ~6ms feature-fetch budget. A single slow read silently blows the page SLO, so:

  • Batch feature reads across candidates into few round-trips.
  • Cache hot user/item features near compute.

Train/serve feature parity

The same feature transforms must run offline and online, computed point-in-time correct (using only data available at request time). Otherwise you get training/serving skew — a top cause of “great offline, bad online.”

Feature freshness vs cost

Be explicit about which features are real-time vs batch:

  • Real-time — last-N watched, current-session signals; updated per event.
  • Batch — heavy item features, aggregate creator stats; updated on a backfill cadence.
8

Bottlenecks, observability & evolution — Eval, retraining & guardrails

Offline eval & splits

Use time-based splits, never random: train on the past, validate on the future, with point-in-time feature snapshots. Random splits leak future engagement into training and inflate every offline metric. Treat offline as a filter, not a decision — the closed loop makes offline/online divergence common.

Metric
Offline proxy
Online guardrail
Engagement quality
Per-head AUC, watch-time XAUC, NDCG
Watch-time, retention, DAU (primary)
Calibration
Reliability curves, ECE
Score-vs-outcome drift
Integrity
Classifier offline precision/recall
Violation rate (must not regress)
Diversity / fairness
Exposure Gini on holdout
Exposure concentration, new-creator reach
Negative signal
Predicted-skip AUC
Report/skip rates

Online A/B + guardrails

Ship via A/B with engagement as the primary metric (watch-time, retention, DAU) but gate on guardrails — integrity violation rate, diversity/exposure concentration, new-creator reach, report/skip rates. Ship only if the primary metric wins and every guardrail holds. A watch-time win that tanks new-creator reach is not a ship.

Retrain cadence

State a defensible number rather than “frequently.” Run incremental/continual training on a short window — hours to a day — to track fast preference and content drift, with full retrains less often (e.g., weekly) to reset accumulated drift. The short window is what keeps the model current with what’s trending today.

Monitor the loop itself

Feed Step 6’s signals back into retraining decisions: exposure Gini, the fraction of impressions coming from exploration slots, and offline-up / online-down alerts. These tell you whether the loop is healthy or quietly collapsing.

Cold-start & freshness as ongoing ops

Keep dedicated fresh-content pools and exploration running continuously so new items and creators stay enterable. Without them the corpus ossifies and the feed becomes a filter bubble — the failure mode this whole design exists to prevent.

Summary

A Staff answer nails six load-bearing moves. Each maps a Senior instinct onto its Staff upgrade.

Funnel with real numbers. Senior: “retrieve then rank.” Staff: billions to thousands to hundreds to tens to ~10, with a latency budget that sums to under 50ms and model complexity rising as candidate count falls — because cost is candidates × per-item model cost.

One calibrated value score. Senior: predict a few probabilities and sum them. Staff: an MMoE/PLE multi-task model with watch-time via weighted logistic regression, heads calibrated (Platt/isotonic) so a weighted sum is meaningful, and the weights framed as a tunable Pareto/product lever — not a fixed hyperparameter.

The closed-loop exposure-bias story, end to end. Senior: “users only see recommended items.” Staff: name exposure, position, and popularity bias; design the propensity-logging contract first; correct with IPS plus a drop-at-serving debias tower; add exploration slots; and detect drift via exposure Gini and offline-up/online-down divergence. This is the Step 6 differentiator.

Funnel consistency. Senior: two separate rankers. Staff: distill late into early for rank consistency, fix sample-selection bias with entire-chain samples, and hold train/serve feature parity through a feature store kept inside the hot-path budget.

Disciplined ops. Senior: track AUC, ship if it rises. Staff: time-based splits with point-in-time snapshots, guardrailed A/B (integrity, diversity, creator fairness), and a defensible incremental-retrain cadence (hours-to-a-day) that keeps the loop from collapsing into a filter bubble.

The one-sentence Staff signal: The model partly chooses its own training data, so the deliverable is a data-collection-and-correction policy, not just a better scorer.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Problem framing & funnel design
Lists retrieval/rank/re-rank stages and roughly correct fan-out
Justifies why a funnel exists (compute ∝ candidates × model cost), sets explicit per-stage fan-out and latency budgets, and ties each stage's model complexity to its candidate count
Retrieval & sourcing
Two-tower + ANN to get candidates; mentions FAISS/HNSW
Blends multiple sources (embedding ANN, follow-graph, trending, fresh), picks IVF-PQ/ScaNN vs HNSW by RAM/recall trade-off, and shards billion-item indexes scatter-gather with a precomputed item-embedding cache
Value model & multi-task blending
Predicts a few engagement probabilities and sums them
Uses MMoE/PLE shared experts + task gates, models watch-time via weighted logistic regression, blends calibrated heads into one value score, and treats weights as a tunable Pareto/product lever
Calibration & label definition
Uses clicks/likes as labels
Defines each label precisely (watch-through threshold, dedups, negative sampling), calibrates per-head probabilities (Platt/isotonic), and calibrates the composite so the blended score is meaningful across surfaces
Closed-loop exposure bias
Notes users only see recommended items
Names selection/exposure/position bias explicitly, logs serving propensities, applies IPS or a debiasing tower, adds exploration, and detects feedback-loop drift before it becomes a filter bubble
Funnel consistency & serving
Two separate models for early and late rank
Distills the late ranker into the early ranker for rank consistency, addresses sample-selection bias (train on impressions, serve on all recalled), and keeps the feature store reads inside budget with caching/batching
Eval, retraining & guardrails
Tracks AUC and ships if it improves
Uses time-based splits with point-in-time feature snapshots, watches for offline/online metric divergence, runs A/B with guardrails (integrity, diversity, creator fairness), and sets a defensible incremental-retrain cadence
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →