AI System DesignStaffPrediction ServingResidual Learning

Design a Learned ETA Prediction Service

A learned ETA service frames arrival-time prediction as a residual on top of a physical routing engine (à la Uber's DeepETA/DeeprETA), not as raw-time regression. The design centers on the prediction-serving hot path — online feature store freshness, a tight latency budget, calibrated quantile outputs for dispatch, and a guaranteed fallback to the routing-engine ETA. Companies like Uber, Lyft, Grab, and Bolt build and interview on exactly this class of system.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Frames ETA as a learned residual on a routing engine, not raw-time regression

·Decomposes a sub-10ms latency budget across feature fetch, model, and fallback

·Designs a deterministic fallback to the routing-engine ETA on model timeout/failure

·Produces calibrated quantiles, not just a point estimate, so dispatch can use uncertainty

★ STAFF-LEVEL SIGNALS

★Argues residual framing as a SAFETY property (bounded blast radius), not just accuracy

★Treats the routing-engine ETA as the floor: the ML can only ever improve or no-op, never break

★Closes the loop: realized-vs-predicted error drives retraining AND online calibration monitoring

★Names the asymmetric-loss business lever (late arrivals cost more) and ties it to a dispatch metric

Framing — ETA as a learned residual, not raw-time regression

“Don’t predict the ETA. Predict how wrong the routing engine is — and never let the model make it worse.”

Ride-hailing platforms — Uber (DeepETA / DeeprETA), Lyft, Grab, Bolt — all run a physical routing engine that produces a base ETA, plus a learned correction on top of it. This is honest industry context, not a leaked or insider question: it is simply the canonical shape this problem takes whenever a learned component sits on the most safety-critical, highest-QPS path in the company. We design to that shape.

The instinct most candidates have — “train a regression model that outputs minutes” — is the wrong frame, and a Staff interviewer is listening for whether you reach for the residual frame instead.

The residual insight

The routing engine already computes a base ETA as the sum of per-segment traversal times along the best path, using map data plus live traffic. It is a well-understood physical model. The ML model predicts only the residual — the difference between that base ETA and the realized arrival time. The served ETA is base_eta + predicted_residual. We learn the correction, never the absolute time.

Why not raw-time regression

A physical routing model already gets the large majority of trips reasonably right (Google Maps has publicly stated its base predictions are accurate for around 97% of trips). If you regress raw arrival time, a model bug can emit an absurd number — a 4-minute trip predicted as 90 minutes — and there is no floor to catch it. If you instead learn a bounded residual on a trusted base, the worst realistic failure is “no correction applied,” and you fall back to the base ETA. Residual framing turns a potential accuracy disaster into a no-op. That is a safety property, not just an accuracy trick — and it is the through-line of this entire design.

Scope cut

In scope: the online serving hot path — feature freshness, latency budget, the residual model, calibrated quantile outputs, and the deterministic fallback to the routing engine.

Out of scope: the routing / shortest-path engine itself, map matching, and raw GPS ingestion. Those are upstream systems we depend on and treat as given. We consume the base ETA; we do not build it.

Who asks this and what they probe

Lens

What they probe

What wins

SDE

Latency budget decomposition, feature-store read path, deterministic fallback

A budget that adds up; fallback as a first-class path, not an afterthought

MLE

Residual framing, loss choice, calibration, geospatial generalization

Asymmetric loss tied to business cost; calibrated quantiles; multi-resolution geo embeddings

Switcher (SDE to AI)

Whether you can map serving instincts onto ML vocabulary without hand-waving

Naming training-serving skew, calibration, and coverage drift precisely, leaning on serving discipline

The Switcher angle matters: most of this system is a serving problem you already know — caches, budgets, fallback, freshness. The new vocabulary is residual learning, quantile calibration, and feature/training skew. Lean on the serving instincts and layer the ML reasoning; do not try to out-MLE the MLE on architecture.

Requirements, scale, and the latency budget

Functional requirements

Given (origin, destination, request_time, request_type) — where request_type distinguishes a rideshare pickup leg from a delivery dropoff leg — return a corrected ETA plus an uncertainty estimate (quantiles, not just a point). The base ETA from the routing engine is passed in as part of the request, so the residual model and the fallback both have it locally.

Non-functional requirements

Latency: p50 of a few milliseconds; p99 comfortably under the caller's budget. Public writeups put the DeeprETA forward pass at a few milliseconds; a single-digit-millisecond model component (low single digits at the median) is a defensible target for the model alone.
Availability: effectively 100%. A missing ETA breaks both the rider map and dispatch. This is why fallback is a hard requirement rather than a nicety.
Freshness: live segment speeds should be seconds-to-low-minutes fresh; embeddings and static geo features may be hours-to-days stale.
Calibration: quantiles must mean what they say — a P90 should be exceeded about 10% of the time.

Scale envelope

ETA is called on every map render, every dispatch candidate, and every iteration of batch matching. A single dispatch decision may evaluate many driver-rider pairs, each needing an ETA. This fans out to hundreds of thousands to over 1M QPS at peak — the single highest-QPS prediction service in the company. Every microsecond and every byte on the hot path is multiplied by that number.

The latency budget

The budget is the spine of the design. An illustrative end-to-end target around 15 ms decomposes like this:

Stage

Budget

Notes

Online feature fetch

5-7 ms

Live segment speeds + embedding lookups

Model forward pass

3-4 ms

Linear-transformer residual model

Pre/post-processing + serialization

~2 ms

Bucketize, clamp, build response

Headroom for fallback

~2 ms

Must still fit the SLO after a timeout

The critical move: the model timeout is set strictly below the SLO (for example, an 8 ms model timeout inside a 15 ms budget). When the model is slow, we abandon it and return the base ETA and still meet the SLO. The budget is designed so that failing safely is also failing fast.

Estimation and the serving hot path

This step decomposes the request flow and the two data paths that feed it: the read path on the hot critical path, and the write path that keeps features fresh.

The read path

caller (routing service / dispatch)

-> ETA service (receives base_eta as input)

-> online feature store lookup (live speeds + embeddings)

-> model server (residual prediction)

-> post-process: clamp, build quantiles

-> return { p50_eta, p90_eta, fallback_used }

The routing service calls the ETA service with the base ETA already attached. The ETA service fetches features, calls the model server, applies the residual, clamps the result, and returns P50/P90 plus a fallback_used flag so callers and monitoring can see when the model was bypassed.

Because the base ETA arrives as an input, the fallback path needs zero extra network calls. That single design choice is what makes a deterministic floor cheap.

The model server

The residual model — a linear transformer (justified in Step 6) — is served behind a low-latency RPC, in the spirit of an online model-serving platform like Uber’s Michelangelo online prediction service. The routing service issues the prediction request and merges the returned residual onto its base ETA. Keeping the merge in the caller means even a totally dead model server degrades to “use base_eta” trivially.

The write path — feature freshness

GPS probes -> segment-speed pipeline -> Kafka

-> stream processor -> online feature store (seconds-fresh: live traffic)

batch jobs -> embedding tables + static geo features

-> published to online store (hours-fresh)

Two tiers write into the online store. Live traffic flows through Kafka and a stream processor and lands seconds-fresh. Embedding tables and static geo features are computed in batch and published on an hours cadence. The serving path only ever reads the online store.

Caching and interpolation

A shared lookup table caches fresh predictions and features for predefined supersegments, refreshed periodically. Rather than recompute every horizon per request, the server interpolates between adjacent precomputed horizons. This moves work off the hot path and amortizes it across the enormous request volume.

Parity by construction

The offline training store and the online serving store are physically separate but share a single feature definition. This is non-negotiable: the number-one source of silent ETA regressions is training-serving skew, where a feature is computed one way offline and another way online, and the model looks great in evaluation while quietly biasing production. One definition, two materializations.

API and consumer contract

This step defines the request/response contract and the multi-consumer schema that downstream systems depend on.

Request

PredictETARequest {

origin: { lat, lng }

destination: { lat, lng }

request_time: timestamp

request_type: enum { RIDESHARE_PICKUP, DELIVERY_DROPOFF }

base_eta_secs: float // from routing engine, REQUIRED

route_segments: [segment_id] // for live-speed feature lookup

}

base_eta_secs is required, not optional. The contract enforces that the floor is always present, so the service can never be in a state where it has no safe value to return.

Response

PredictETAResponse {

p50_eta_secs: float // stable point for rider UI

p90_eta_secs: float // upper quantile for dispatch

fallback_used: bool // true when base_eta was returned

model_version: string // for schema/version tracking

}

Multi-consumer contract

Two consumers read different fields from the same response:

Rider UI wants a stable point estimate (p50_eta_secs). Jitter feels broken to a rider watching the map, so this field is smoothed and stable.
Dispatch / batch matching wants quantiles (p90_eta_secs) to reason about the worst-case pickup time when matching a batch of riders to drivers.

One model, one versioned schema, different fields consumed. The schema is versioned (model_version) so we can evolve fields — add a P95, change clamping — without breaking either consumer. Defaulting to a single bare number is the classic Senior simplification that quietly starves dispatch of the uncertainty it needs.

Data model — features, encoding, and geospatial generalization

This step covers the feature taxonomy, how features are encoded for the model, and how locations are represented so the model generalizes across geographies.

Feature taxonomy

Family

Examples

Freshness tier

Spatial

origin, destination, route segments

hours (embeddings)

Temporal

time of day, day of week

static

Live traffic

current segment speeds

seconds-minutes

Request nature

rideshare pickup vs delivery dropoff

per-request

Base ETA

routing engine's base_eta

per-request

The routing engine’s base ETA is itself a feature: it anchors the residual and lets the model learn “in this context the base tends to run a bit optimistic.”

Feature encoding

Bucketize continuous features and embed all categoricals. In DeepETA’s ablations, bucketing continuous inputs beat feeding them raw. Each feature — continuous-bucketed or categorical — becomes a vector, like a token, except each token represents a feature, not a word. The model then reasons over feature interactions the way a transformer reasons over a sequence.

Geospatial generalization — why raw lat/lng fails

Raw lat/lng or a single city ID overfits dense metros and collapses in sparse regions: the model memorizes downtown and has nothing meaningful to say in a thin suburb. The fix, following DeeprETA:

1. Quantize each location into multiple resolution grids via geohash.

2. Hash each grid cell with an independent hash function per resolution to crush cardinality.

3. Learn an embedding per hash bin.

location -> geohash @ {coarse, medium, fine}

-> hash each -> bin ids

-> embedding lookup per resolution -> concat

Multiple resolutions directly attack sparsity. Coarse grids generalize where data is thin; fine grids specialize where it is dense. The model leans on whichever resolution has signal. Lookup is O(1) — quantize, hash, fetch — which is exactly what makes this serveable inside a few-millisecond budget. It is a deliberate space-time tradeoff: large embedding tables learned offline precompute partial answers and move compute out of the request path.

Freshness tiers and parity

Each feature is tagged with an acceptable staleness. Live segment speeds (seconds-minutes, streamed) are alerted on if the pipeline lags; learned embeddings and static geo features (hours-days, batch) tolerate much more lag. Online and offline features are computed from the same definition so the training distribution matches serving — skew here is the failure where the model dazzles offline and silently biases production ETAs.

Cold start and sparse regions

With no live speeds available, the live-traffic features degrade to coarse priors and the predicted residual shrinks toward zero. The system leans on the routing engine — which is precisely the safe behavior. Sparse data does not produce a wild guess; it produces “trust the physical model.”

High-level architecture — reliability and the deterministic floor

This is the SDE backbone of the question: the fallback contract, the failure taxonomy, and the health signals that keep a learned component safe on the critical path.

The fallback contract

On model timeout, NaN or out-of-range output, feature-store miss, or model-server unavailability, the service returns the raw routing-engine base ETA. Because the base ETA arrived as an input feature, fallback requires no extra dependency or network call — it is a local read of a value we already hold.

The model is a strictly optional enhancer. Worst case, it no-ops and we ship the physical ETA. This is the core safety property residual framing buys: the blast radius of any model bug is bounded by “no correction,” never “absurd ETA.”

Failure-to-response taxonomy

Failure

Response

Model timeout

Return base ETA

Residual pushes ETA negative or absurdly large

Clamp, or return base ETA

Stale live-traffic feature

Use last-good value with staleness flag

Feature store down

Return base ETA with degraded flag

Model server unavailable

Return base ETA

Hard timeout below the SLO

The model timeout sits below the SLO (for example, 8 ms model timeout within a 15 ms budget) so that even when we abandon the model, the fallback path returns within SLO. Failing safe and failing fast are the same code path.

Health signals

Fallback rate is the headline metric. A rising fallback rate means the model is silently absent — accuracy can look fine in aggregate while the model is barely serving. This is the canary for the whole system.
p99 latency, quantile coverage, and per-region residual bias round out the dashboard.

Deployment safety

Ship in stages: shadow the candidate on live traffic — logging its outputs without serving them — to validate latency and calibration on real distributions; then canary a small traffic slice with auto-rollback if latency, error, or calibration breach thresholds. The deployment pipeline treats a calibration regression as a rollback trigger, not just a latency or error regression.

Deep dive — modeling, calibration, and where Staff is won

WHERE STAFF IS WON

This is the longest section because it is where a Staff answer separates from a Senior one: the model architecture chosen against the latency budget, the loss tied to business cost, calibrated quantiles, and the closed retraining loop connected to dispatch decisions.

Architecture choice, decided by the SLO

DeepETA evaluated seven architectures: MLP, Neural ODE, TabNet, Sparse Mixture-of-Experts, HyperNetworks, full Transformer, and Linear Transformer. The deciding constraint is latency, not raw accuracy.

A full self-attention transformer is O(K²) in the number of feature tokens K — it materializes a K×K attention matrix and blows the millisecond budget. The Linear Transformer uses the kernel trick to approximate attention without ever forming that matrix, collapsing the cost to linear in K while keeping most of the feature-interaction power. The accuracy/latency trade is decided by the SLO, and we say so explicitly: we choose the linear transformer because it hits low single-digit milliseconds, and we accept the small expressiveness loss versus full attention as the price of being serveable at 1M+ QPS.

Loss — asymmetric Huber tied to business cost

The loss has two jobs:

Huber (parameter δ) for outlier robustness — GPS noise and rare pathological trips should not dominate the gradient the way squared error would.
Asymmetry (parameter ω) so that underprediction (arriving late) is penalized more than overprediction (arriving early).

That asymmetry is not a modeling flourish — it encodes a real business cost. A rider told “2 minutes” who waits 6 is a worse outcome than one told “6 minutes” who waits 2. The loss makes the model prefer to be slightly pessimistic, matching the asymmetric cost of lateness.

Calibration — quantiles, not a bare point

The model emits calibrated quantiles (for example P50 and P90) via quantile / asymmetric loss, so downstream consumers receive uncertainty, not a bare point. For delivery especially, the 95th-percentile error matters as much as the mean — a customer cares about the worst plausible wait, not the average. A single point estimate is simply the wrong contract for dispatch and delivery.

Output clamping

The output is clamped to a sane range. A residual that would push the ETA negative or absurdly large is treated as a model failure and triggers fallback to the base ETA. Clamping is the last line of defense before a bad number reaches a rider or a matching decision.

Closing the loop — realized vs predicted

Every completed trip yields a realized arrival time. The realized-vs-predicted residual error is simultaneously the training label and the live quality signal. Retraining is driven by the error distribution — when realized error drifts, we retrain — not by a fixed calendar. The same signal that teaches the next model also tells us the current one is going stale.

Calibration drift — what Staff watches that Senior doesn’t

Monitor whether P90 predictions actually cover about 90% of outcomes. Coverage drift, not just MAE drift, is the early warning that the model has gone stale. A Senior watches accuracy; a Staff watches calibration, because a model can hold its MAE while its uncertainty estimates quietly decay — and dispatch is consuming those uncertainty estimates.

Per-region, per-time bias

Track residual bias per region and per time bucket. This catches the insidious failure where global MAE looks fine but the model systematically under-predicts in one city at rush hour — exactly the skew that quietly corrupts dispatch in that market while the aggregate dashboard stays green.

Staff insight — ETA error propagates into matching cost

The reason lateness is penalized harder is not that MAE looks nicer. ETA error propagates into matching cost: an over-optimistic ETA causes the matcher to assign a driver who then arrives late, producing a bad assignment and a poor rider experience. The asymmetric loss is tied directly to this — we penalize lateness because lateness is what corrupts downstream dispatch decisions. The model’s loss function is, in effect, a dispatch-quality lever.

Staff insight — the routing engine as the contractual floor

The deepest argument for residual framing is organizational. The routing engine is the contractual floor: product, dispatch, and on-call can all reason about a system whose worst case is “the physical ETA we already trusted.” That bounded worst case is what makes it organizationally acceptable to put a learned model on the critical path at all. You are not asking the org to bet the dispatch system on a neural network — you are asking it to let a neural network optionally improve a number it already trusts.

Rollout, scaling, and the QPS reality

This step covers how the system scales horizontally, the cost levers that make a 1M+ QPS fleet affordable, and the rollout posture.

Horizontal scaling

At over 1M QPS the model-server fleet is large. Two properties keep per-request compute low enough to make that fleet affordable: O(1) embedding-table lookups (no per-request graph computation) and a compact linear transformer (small enough to run on CPU). The bulk of the parameters live in offline-learned embedding tables; any one prediction touches only a tiny fraction of them, so per-request compute stays small.

Cost levers

Supersegment caching + horizon interpolation cut redundant computation across overlapping requests.
Request batching within a few milliseconds amortizes the forward pass across many requests without breaking the latency budget.
Quantize / compile the model for CPU serving — per-request compute is small, so CPU is cheap and avoids GPU scheduling overhead at this fan-out.

Hot-path optimizations

Precompute and cache embeddings so the request path only does lookups.
Co-locate the online feature store with the model server to shave the feature-fetch portion of the budget — network hops are the dominant cost at single-digit-millisecond targets.

Regional sharding and rollout

Serve models close to the geography to cut network latency and to allow per-region model variants where traffic patterns differ sharply. Roll out new models region by region behind shadow and canary so a regression is contained to one market.

Backpressure

If the feature store or model server saturates, shed load to the fallback (base ETA) rather than queueing. On this path, degraded-but-fast beats slow-and-correct — a rider gets a slightly-less-precise ETA instantly instead of a perfect one too late to matter. The fallback doubles as the overload valve.

Bottlenecks, tradeoffs, and what you'd cut

This step is honest about the alternatives considered and what gets deferred for v1.

Residual post-processing vs end-to-end learned routing

Residual is safer, far cheaper to serve, and ships incrementally on top of an existing routing engine. A fully learned GNN over the road graph (à la Google Maps / DeepMind, which has reported accuracy gains up to around 50% in some cities) is more powerful but heavier to serve and much harder to bound. That is a longer-horizon bet, not the v1. The residual gives most of the benefit with a fraction of the operational risk.

Point vs quantile output

Quantiles cost a bit more to train and serve and complicate the contract, but they are what dispatch needs to reason about worst-case pickup. Defaulting to a point estimate is a classic Senior simplification that quietly hurts matching quality without ever showing up as a model-accuracy regression.

Full transformer vs linear transformer

Full self-attention is more expressive but O(K²) and breaks the latency budget. The choice is dictated by the SLO — and stating that explicitly is itself a signal of Staff-level judgment: you let the production constraint pick the architecture, rather than picking the fanciest model and hoping it fits.

What you’d cut for v1

Per-region model variants, the GNN, and exotic features all wait. Ship the residual linear transformer with rock-solid fallback and calibration monitoring first, because the reliability of this path matters more than the last point of MAE.

Honest closer

The hard, interesting part of this problem is not the model — it is the serving discipline (freshness, budget, fallback, calibration) that lets a learned component sit on the highest-QPS, most safety-critical path in the company.

✓

Summary

A checklist of the load-bearing decisions, with the four Staff-vs-Senior separators in bold.

Residual framing as a safety property: the routing engine is the floor; the model can only improve or no-op, never break the ETA. The blast radius of any model bug is bounded by "no correction."
Latency budget decomposed (feature fetch + ~3-4 ms model + serialization) with a model timeout set below the SLO so the fallback always fits within budget.
Calibrated quantiles via asymmetric Huber loss, tied to the real cost of lateness and to dispatch / batch-matching decisions — not a bare point estimate.
Closed loop on realized-vs-predicted error driving both retraining and online calibration / fallback-rate monitoring, with shadow then canary then auto-rollback deployment.
Geospatial generalization via multi-resolution geohash + feature hashing + learned embeddings, so the model degrades gracefully into the routing ETA where data is sparse.
The deterministic fallback and the base ETA passed in as a feature, so the floor is always present and free to reach.

The one-liner to leave them with: “I’m not building a model that predicts ETA — I’m building a serving system that lets a model safely correct an ETA we already trust.”

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing

Predicts ETA directly with a regression model on trip features; treats the routing engine as just another feature.

Frames ETA as a residual ON the routing engine so the physical model is the floor; argues this bounds the blast radius of a bad prediction and makes fallback trivial.

Latency budget

States a p99 target (e.g. 'under 50ms') but doesn't decompose where the time goes.

Decomposes the budget: feature fetch, model forward pass (~3-4ms like DeeprETA), serialization, network; sets a hard timeout BELOW the SLO so fallback still fits the budget.

Feature freshness

Mentions a feature store and 'real-time features' without specifying staleness or the write path.

Separates streaming live-traffic features (seconds-fresh via Kafka→online store) from batch/embedding features; quantifies acceptable staleness and guarantees online/offline parity to avoid training-serving skew.

Fallback & failure

Falls back to 'the last cached prediction' or returns an error on model failure.

Returns the raw routing-engine ETA deterministically on timeout, NaN, or out-of-range output; the model is a strictly-optional enhancer; alerts on fallback rate as a health signal.

Calibration & uncertainty

Outputs a single point ETA; equates lower MAE with a better product.

Outputs calibrated quantiles (P50/P90) via quantile/asymmetric-Huber loss; monitors coverage; ties asymmetry to the business cost of being late vs early and to dispatch decisions.

Geospatial generalization

Uses raw lat/lng or a single city ID; model overfits dense areas and fails in sparse regions.

Discretizes locations into multi-resolution grids with feature hashing + learned embeddings (DeeprETA-style) so the model generalizes across geographies and degrades gracefully where data is sparse.

Retraining & monitoring

Retrains 'periodically'; monitors model accuracy offline only.

Closes the loop on realized-vs-predicted error; runs shadow + canary with auto-rollback on latency/error/calibration breach; monitors fallback rate, quantile coverage, and per-region bias for drift.

Consumer contract

Returns one ETA number to whoever calls.

Designs the contract for multiple consumers (rider UI wants a stable point, dispatch wants quantiles for batch matching); versions the schema and reasons about how ETA error propagates into matching cost.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →