AI System DesignStaffActive LearningAuto-Labeling

Design an Autonomous-Driving Data Engine (Active-Learning + Auto-Labeling Loop)

A closed-loop "data engine" is how companies like Tesla, Waymo, and Cruise turn a fleet of millions of vehicles into a self-improving training-data factory: the on-vehicle model flags moments it's likely wrong, uploads short sensor clips, a heavy offline model auto-labels them, only the genuinely hard cases reach humans, and the model retrains and redeploys. This question probes whether you can co-design the active-learning acquisition function, the offline auto-labeler (including non-causal use of future frames), and the petabyte-scale, idempotent orchestration around them — without regressing the deployed model.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Can you design an active-learning acquisition function (uncertainty / ensemble disagreement) instead of random sampling?

·Do you understand why an OFFLINE auto-labeler can beat the online model — non-causal future frames, no latency budget, ensembles?

·Can you make a petabyte-scale, idempotent retrain/redeploy loop with regression gates?

·Do you handle catastrophic forgetting and label noise rather than assuming clean data?

★ STAFF-LEVEL SIGNALS

★Closes the loop end-to-end and reasons about the flywheel's feedback dynamics (drift, bias toward triggered cases)

★Quantifies the loop: TB/hour per car, trigger fire-rate budget, human-label throughput, retrain cadence

★Designs the human-in-the-loop as a router (auto-label confidence → consensus → escalate) not a binary

★Treats 'is the new model actually better?' as a shadow-eval + regression-suite problem, not a single offline metric

Scope — frame the loop & who's asking

A self-driving model is only as good as the rarest situation in its training set — the data engine exists to manufacture that rare data on purpose.

A “data engine” is not a pipeline that runs once; it is a closed feedback loop that turns a deployed fleet into a self-improving training-data factory. The on-vehicle model flags moments it is likely wrong, the car uploads short sensor clips, a heavy offline model auto-labels them, only the genuinely hard cases reach humans, and the model retrains and redeploys — and the next version of the model decides what the fleet collects after that. This is Tesla’s “Data Engine” (Karpathy, Autonomy Day 2019 / AI Day 2021), Waymo’s offboard auto-labeling, and the general industry “data flywheel.”

The interviewer is testing whether you can co-design three things that a generic ETL pipeline does not have: an active-learning acquisition function (what to capture), an offline auto-labeler that exploits non-causal future frames, and an idempotent retrain/redeploy loop with regression gates — all at petabyte scale and without silently regressing the deployed model.

What a “data engine” is

It is RLHF for perception. The acquisition function is your reward-model-as-data-selector, the offline auto-labeler is a stronger teacher distilling into a latency-bound student, and the human QA queue is the same hard-case routing problem as LLM preference annotation. If you have built an LLM eval-and-retrain flywheel, this is the sensor-clip analog: same loop, different modality.

The closed-loop diagram

+------------------------------------------------+

| |

v |

(1) on-vehicle --> (2) bandwidth-aware --> (3) offline

triggers + dedup'd upload auto-labeler

acquisition (future frames)

(6) regression+ <-- (5) anti-forgetting <-- (4) human QA

shadow-gated retrain (hard cases only)

staged redeploy

+--> updated trigger definitions fan back out to fleet

Who asks this & what they probe

Role

What they own

Hardest questions

SDE

Trigger fan-out to millions of cars, bandwidth-aware upload, petabyte storage + indexing, idempotent orchestration

Scale, cost per PB, exactly-once semantics

MLE

Acquisition function, offline auto-labeler, label-noise/escalation, catastrophic forgetting

What to label, how to trust labels, how to prove the new model is better

Switcher (SDE → AI)

Lead with the orchestration/storage you know, then the ML layer

Why uncertainty drives selection, why offline out-labels online

Scoping the loop

The six stages are: (1) trigger on-vehicle, (2) bandwidth-aware upload, (3) offline auto-label, (4) human QA on hard cases only, (5) retrain with anti-forgetting, (6) regression-gated staged redeploy. The driving constraint behind all six is the long tail: a single car generates 1–5 TB/hour of raw sensor data, you cannot label everything, and the model fails on rare corner cases — so selection is the whole game.

Requirements — triggers & the acquisition function

The first requirement is deciding what to capture. Capture everything and you drown; sample randomly and you spend your entire human-label budget re-confirming highway scenes the model already aces. So the design is two-tier selection.

Cheap on-device triggers

Cheap, rule-based triggers fire in real time on the vehicle and decide what is even eligible for upload. They are deliberately dumb because they run under the on-vehicle compute and latency budget.

Trigger type

Fires when

Signal it captures

Planner-vs-human

Planned trajectory diverges from the human driver's actual path

Behavioral disagreement

Hard event

Hard brake, swerve, high jerk

Safety-relevant moment

Disengagement

Safety driver takes over

Strong failure label

Rare-object hit

Detector fires on a rare class (debris, animal)

Long-tail object

Low-confidence

Per-frame detector confidence drops

Cheap uncertainty proxy

The real acquisition function

Triggers gate eligibility; the acquisition function ranks and mines the eligible pool to decide what is actually worth labeling. It runs offline where compute is cheap. The signals:

Predictive entropy — the entropy of the mean (model-averaged) prediction — measures total predictive uncertainty: aleatoric (irreducible, data) plus epistemic (model) uncertainty mixed together. A high value alone cannot tell a genuinely ambiguous scene from one the model simply hasn't learned yet.
Epistemic (model) uncertainty is what you actually want for active learning, because it is reducible by more data. Estimate it via an ensemble or MC-dropout: where members disagree, the model doesn't know.
BALD (Bayesian Active Learning by Disagreement) isolates epistemic uncertainty cleanly: it is the entropy of the mean prediction (total) minus the mean of per-member entropies (the aleatoric part). High BALD = the members are individually confident but disagree with each other — the highest-value-to-label signal.

def bald_score(member_probs): # [M members, C classes]

mean_p = member_probs.mean(axis=0) # [C]

H_mean = -(mean_p * log(mean_p)).sum() # total uncertainty

mean_H = -(member_probs * # expected

log(member_probs)).sum(1).mean() # aleatoric

return H_mean - mean_H # epistemic only

NVIDIA’s production active-learning system used an ensemble of roughly 8 models to compute disagreement and select frames from an unlabeled pool — and beat random sampling in a controlled A/B test.

Diversity: Add a coverage term (core-set / feature-space distance) so you don’t select 10,000 near-identical clips of the same uncertain highway scene. Uncertainty tells you which scenes are hard; diversity stops you from buying the same hard scene 10,000 times.

The fire-rate budget

Budget the trigger fire-rate explicitly. At 1–5 TB/hr raw, you can only afford to upload a tiny fraction, so triggers are tuned to a target clips/car/day and WiFi-gated. Watch the flywheel bias: selecting only what the model is unsure about over-represents its current failure modes and starves it of easy-but-shifting cases. Mix in a small random baseline stream so the training distribution doesn’t collapse onto the model’s own mistakes.

Estimation — bandwidth-aware edge-to-cloud upload

Now get the selected data home — the SDE-heavy half — without blowing the cellular bill or the cloud ingest tier.

Clip sizing (the numbers block)

A snapshot clip = all major sensors for a short window (Tesla uploads roughly 1-minute multi-sensor snapshots on a corner case).

A 128-beam LiDAR emits ~2.6M points/s, on the order of ~1 Gbps.
Cameras stream well over 30 GB/hr per surround rig.
A multi-sensor clip of a few seconds to a minute is therefore hundreds of MB to a few GB.

On-device pre-filter

Pre-filter and compress before upload: drop redundant frames, downsample where safe, and run the cheap triggers so only eligible clips are even queued. The car ships a curated artifact, not a raw firehose.

WiFi / bandwidth gating

Uploads are gated mostly on whether the car is on WiFi and how much it drives. Cellular is reserved for the highest-priority clips (e.g. a disengagement) under a hard byte cap; the bulk waits for the garage WiFi. Never blow the cellular budget.

Dedup

Dedup at the edge and in-cloud via perceptual/scene hashing so the fleet doesn’t upload a million copies of the same common intersection. A million cars passing the same on-ramp should yield a handful of clips, not a million.

Idempotency

Use content-addressed clip IDs (a hash of the payload) so retries and duplicate triggers don’t create duplicate rows. Large clips use resumable multipart upload. Staged commit: edge → regional ingest/buffer → object store, with backpressure when the labeling or storage tier saturates — the fleet slows its own ingest instead of melting the backend.

API design — petabyte clip store & mining index

The storage layer needs two APIs: one to put curated clips durably and cheaply, and one to find them by semantic content. The find API is the actual product.

Storage tiers

Volume reality: a fleet at ~80 TB/day crosses 1 PB in under two weeks; a 5,000-vehicle fleet could generate well over 100 exabytes/year of raw — you store a curated fraction, not everything. The bulk of cost is storage, not compute.

Layer

What it holds

Tier

Access pattern

Raw clips

Sensor payloads

Hot object store (active), cold archive (tail)

Streamed into training

Metadata

Geo, weather, time, ego-speed, trigger reason, detected objects

Indexed DB

Queried for mining

Embeddings

Per-clip learned vectors

Vector index

Nearest-neighbor retrieval

Use lifecycle policies: clips in active labeling/training stay hot; the long tail rolls to Glacier-class cold storage.

The metadata index is the product

The mining index is the real asset. A query API over rich per-clip metadata plus learned embeddings lets you ask “cyclists at night in rain” and get clips on demand:

SELECT clip_id FROM clips

WHERE weather = 'rain' AND tod = 'night'

AND 'cyclist' = ANY(detected_objects)

ORDER BY embedding <-> :query_vec -- vector distance

LIMIT 5000;

Embedding-based retrieval lets a human describe a failure pattern and pull similar clips fleet-wide — the “search the fleet for more like this” step. Everything is versioned and lineage-tracked: which clips fed which dataset version fed which model version (reproducibility + rollback).

Data model — the offline auto-labeler

This is the MLE core. The same data lands in two model contexts: the online model that drove the car, and an offline auto-labeler that re-derives ground truth far more accurately. The data model that matters is the pseudo-label record: a label plus a calibrated confidence.

Why offline beats online

Constraint

Online (on-vehicle)

Offline auto-labeler

Latency budget

Hard real-time

None

Compute

On-vehicle SoC

Datacenter, unbounded

Causality

Only past + present

Sees the whole clip

Model size

Small, distilled

Heavy ensemble

Non-causal future frames

This is the key unlock. The auto-labeler uses future frames to label the present. An object briefly occluded now is clearly visible 2 seconds later; its track interpolates backward through the occlusion, so the present frame gets a confident box it could never have had online. The online model can never do this — it has no future.

Object-centric track refinement

Waymo’s “3D Auto Labeling” (Offboard 3D Object Detection from Point Cloud Sequences, CVPR 2021) uses multi-frame, object-centric track refinement over 10+ seconds of point clouds, and produces labels on par with — or better than — expert human labelers. The pipeline:

1. Detect objects per frame with a heavy ensemble.

2. Associate detections into tracks across the full clip.

3. Refine each object’s track using the full temporal context (forward and backward).

4. Emit 3D boxes / lanes / agent trajectories as pseudo-labels, each with a calibrated confidence.

That confidence is load-bearing: it is exactly what routes the clip to auto-accept versus human QA in the next stage.

This is teacher → student distillation in spirit: a strong, non-causal teacher manufactures labels to train the latency-bound causal student that ships on the car.

High-level architecture — human-in-the-loop as a confidence router

Don’t model the human queue as a binary “low-confidence → human.” Model it as a multi-tier router keyed on auto-label confidence, because humans are the scarcest and most expensive resource in the loop and labeling cost is an explicit optimization target.

Tiers, not binary

Tier

Condition

Action

Cost

Auto-accept

High auto-label confidence

Ship label as-is

Consensus

Medium confidence

Multi-annotator vote

Medium

Expert escalation

Low confidence / ambiguous

Senior reviewer adjudicates

High

The whole point of auto-labeling is to send only hard cases to humans. If the auto-labeler is good, the auto-accept tier absorbs the vast majority of clips and humans see only the genuinely ambiguous remainder.

Trust the labels

You must measure label quality, not assume it:

Inter-annotator agreement (Cohen's / Fleiss' kappa) on the consensus tier — low kappa means the guideline is ambiguous, not that annotators are bad.
Gold-set audits: seed known-answer clips into the queue to score annotators and the auto-labeler continuously.
Auto-label vs human disagreement is itself a model-error signal.

Disagreement handling: when the auto-labeler and humans disagree systematically, that flags either label-guideline drift or a model blind spot — feed it back into trigger tuning. Active learning closes here too: the clips where auto-labeler confidence is lowest and humans disagree most are exactly the highest-value training examples.

Deep dive — retrain without regressing

WHERE STAFF IS WON

This is where Staff is won. The loop is worthless if the retrained model is better on the new corner case but silently worse on the common cases it used to nail. Two forces fight you: catastrophic forgetting and silent regression. Naming the stability–plasticity dilemma explicitly — plasticity to learn the new tail, stability to retain old competence — is the Staff signal.

Catastrophic forgetting is the default failure

The naive trap: fine-tune only on freshly mined corner cases. The model improves on the new hard case and silently degrades on the broad distribution it learned before, because gradient steps on the new data overwrite weights that encoded old competence. By construction your mined data is a biased slice — over-representing triggered failures — so unconstrained fine-tuning overfits the model to its own past mistakes. This is the single most common way a data-engine retrain ships a worse model while every “new corner case” metric goes up.

Mitigations

Technique

Mechanism

Cost / note

Replay buffer

Mix 1–5% representative prior data into every retrain

Cheap; even a small fraction sharply cuts forgetting

EWC

Penalize changing weights important to old tasks (Fisher-weighted)

Needs old-task importance estimates

Distillation anchor

Distill from the previous model as a stability term

Keeps old behavior even without old data

Mix re-balancing

Re-weight triggered hard cases against a held-out distribution

Stops overfit to the biased mined slice

The cheapest high-leverage move is the replay buffer: blending even 1–5% of representative prior data back into every retrain dramatically reduces forgetting. Layer EWC and distillation on top when the tail is large enough to dominate the gradient.

Curate the training mix deliberately

Triggered hard cases are over-represented by construction. Re-balance them against a held-out reference distribution so the model learns the new tail without distorting its prior over common scenes. The acquisition function biases what you collect; the mix policy un-biases what you train on.

Is the new model ACTUALLY better?

“Better” is a regression problem, not a single metric. A mean-metric win can hide a tail regression — the exact thing a safety system cannot tolerate.

Frozen regression / scenario suite: a curated set of safety-critical scenarios that must never regress, version-frozen so the bar can't drift. Gate every candidate on it as a hard pass/fail before any aggregate metric is even consulted.
Aggregate metrics on a held-out, distribution-matched eval set — but only as a secondary gate, because the mean hides the tail.
Shadow-mode evaluation: run the candidate on the real fleet in shadow — it predicts but does not actuate — and compare its outputs against the deployed model and against human behavior, on live traffic, before any rollout. This reuses the very same fleet that produces the triggers, closing the loop on itself.

Treating “prove no regression” as a first-class deliverable — with a frozen suite plus shadow eval, not an afterthought offline number — is the differentiator between a senior and a Staff answer here.

Rollout strategy — idempotent orchestration & redeploy

The loop is a DAG

The retrain/redeploy cycle is a DAG: ingest → dataset-build → auto-label → QA → train → eval → deploy. Make every stage idempotent and resumable so a mid-pipeline failure re-runs cleanly without double-counting data.

ingest -> dataset-build -> auto-label -> QA

train -> eval (regression + shadow) -> deploy

Exactly-once where it matters

The mechanics:

Content-addressed dataset versions and deterministic dataset snapshots, so a re-run on the same inputs yields the same outputs.
Checkpointed stages, so a crash resumes from the last good boundary.
The clip → label → dataset-membership chain must never be duplicated — a double-counted clip skews the training distribution. Reuse the content hashes from upload as stable keys end-to-end, so dedup is enforced by construction rather than by a fragile cleanup job.

Staged redeploy

Redeploy is a staged OTA rollout: canary on a small fleet slice → monitor real-world + shadow metrics → progressive rollout → fast rollback path keyed to model version. Crucially, trigger definitions are versioned and fanned out the same way (OTA, staged). The loop updates both the model and what the fleet collects next — that dual update is precisely what makes it a flywheel rather than a one-shot pipeline.

Bottlenecks & evolution

The system has three structural bottlenecks, each a budget against the long tail or against silent regression.

Bandwidth is the hard ceiling on data in. At 1–5 TB/car/hr raw against a WiFi-gated upload budget, the acquisition function's precision is what determines whether your scarce upload bytes buy high-value clips or noise. Observe: clips/car/day, upload-byte budget utilization, fraction of uploaded clips that survive to training.
Human throughput is the ceiling on label volume. The auto-labeler's auto-accept rate is the lever; if it drops, the human queue backs up and retrain cadence stalls. Observe: auto-accept rate, escalation rate, inter-annotator kappa, auto-label-vs-human disagreement rate.
Silent regression is the ceiling on safe ship velocity. Observe: frozen-suite pass/fail, tail metrics (not just means), shadow-mode delta vs deployed model.

Flywheel feedback dynamics are the subtle long-term risk: the model increasingly trains on data its own uncertainty selected, so it can drift toward over-fitting its past failure modes and under-sampling slowly-shifting common cases (seasonal, geographic, new vehicle types). Mitigations: the random baseline stream from Step 1, periodic full-distribution re-evaluation, and drift monitors on the input distribution itself. Evolution: as the auto-labeler improves it absorbs more of the human tier (humans shrink toward pure auditors); as embeddings improve, fleet-wide “find more like this” mining gets cheaper and the acquisition function shifts from per-frame uncertainty toward semantic-coverage targeting of named long-tail scenarios.

✓

Summary

One breath: triggers + acquisition select the rare data → bandwidth-aware dedup’d upload → petabyte store + mining index → non-causal offline auto-labeler → confidence-routed human QA → anti-forgetting retrain → regression + shadow-gated staged redeploy → updated triggers feed the next loop.

Three differentiators vs a generic ML pipeline:

1. The selection signal is model uncertainty / active learning (epistemic disagreement, BALD, diversity), not data-quality heuristics or random sampling.

2. There is an offline auto-labeling model that exploits non-causality — future frames and object-centric track refinement — to out-label the online model and approach expert humans.

3. A human-in-the-loop QA router (auto-accept → consensus → expert) that an LLM web-text pipeline has no analog for.

Staff throughline: every stage is a budget against two forces — the long tail (you can’t label everything, so select ruthlessly) and silent regression (you can’t ship blind, so gate everything with a frozen suite + shadow eval).

Carry the numbers: 1–5 TB/car/hr raw; a fleet crosses 1 PB in roughly 2 weeks; ~8-model ensemble for disagreement; 1–5% replay to fight forgetting; offline labels matching expert humans.

If asked to cut scope: keep the acquisition function, the offline auto-labeler, and the regression gate — those three are the irreducible core of a data engine. Everything else is plumbing in service of them.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing & loop closure

Lists the stages (trigger → upload → label → human → retrain → deploy) as a linear pipeline.

Frames it as a closed FEEDBACK loop and reasons about flywheel dynamics: selection bias toward triggered cases, drift, and the risk of the model learning only its own past mistakes.

Active-learning / acquisition design

Knows to use model uncertainty (entropy/softmax) to pick interesting frames over random sampling.

Distinguishes aleatoric vs epistemic uncertainty, uses ensemble/MC-dropout disagreement (BALD) plus diversity to avoid redundant near-duplicates, and budgets a trigger fire-rate against fleet bandwidth.

Auto-labeler design

Proposes a bigger offline model to pre-label clips and reduce human effort.

Exploits non-causality (future frames), removes the real-time latency/compute budget, uses object-centric multi-frame track refinement, and quantifies that offline labels can match expert humans.

Human-in-the-loop / QA routing

Sends low-confidence labels to a human queue.

Designs a multi-tier router (auto-accept high-confidence → consensus/multi-annotator on medium → expert escalation on hard), measures inter-annotator agreement, and treats labeling cost as an explicit optimization target.

Infra: fan-out, upload, storage

Uploads clips to cloud storage and stores them in a data lake.

Designs versioned trigger fan-out (OTA, staged rollout), WiFi/bandwidth-gated dedup'd upload with on-device pre-filtering, and a petabyte clip store with rich metadata indexing for mining; reasons about cost per PB.

Retrain / redeploy / regression

Retrains on the new data and ships the model.

Makes the DAG idempotent/resumable, mitigates catastrophic forgetting (replay buffer, EWC, distillation), and gates deploy on a frozen regression suite + shadow-mode eval before staged rollout.

Quantification & tradeoffs

Gives rough hand-wavy scale ('lots of data').

Carries real numbers (1–5 TB/car/hr, trigger budget, human throughput, retrain cadence) and reasons explicitly about cost/latency/label-quality tradeoffs at each stage.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →