Design a Learned ETA Prediction Service
A learned ETA service frames arrival-time prediction as a residual on top of a physical routing engine (à la Uber's DeepETA/DeeprETA), not as raw-time regression. The design centers on the prediction-serving hot path — online feature store freshness, a tight latency budget, calibrated quantile outputs for dispatch, and a guaranteed fallback to the routing-engine ETA. Companies like Uber, Lyft, Grab, and Bolt build and interview on exactly this class of system.
Framing — ETA as a learned residual, not raw-time regression
“Don’t predict the ETA. Predict how wrong the routing engine is — and never let the model make it worse.”
Ride-hailing platforms — Uber (DeepETA / DeeprETA), Lyft, Grab, Bolt — all run a physical routing engine that produces a base ETA, plus a learned correction on top of it. This is honest industry context, not a leaked or insider question: it is simply the canonical shape this problem takes whenever a learned component sits on the most safety-critical, highest-QPS path in the company. We design to that shape.
The instinct most candidates have — “train a regression model that outputs minutes” — is the wrong frame, and a Staff interviewer is listening for whether you reach for the residual frame instead.
The residual insight
The routing engine already computes a base ETA as the sum of per-segment traversal times along the best path, using map data plus live traffic. It is a well-understood physical model. The ML model predicts only the residual — the difference between that base ETA and the realized arrival time. The served ETA is base_eta + predicted_residual. We learn the correction, never the absolute time.
Why not raw-time regression
A physical routing model already gets the large majority of trips reasonably right (Google Maps has publicly stated its base predictions are accurate for around 97% of trips). If you regress raw arrival time, a model bug can emit an absurd number — a 4-minute trip predicted as 90 minutes — and there is no floor to catch it. If you instead learn a bounded residual on a trusted base, the worst realistic failure is “no correction applied,” and you fall back to the base ETA. Residual framing turns a potential accuracy disaster into a no-op. That is a safety property, not just an accuracy trick — and it is the through-line of this entire design.
Scope cut
In scope: the online serving hot path — feature freshness, latency budget, the residual model, calibrated quantile outputs, and the deterministic fallback to the routing engine.
Out of scope: the routing / shortest-path engine itself, map matching, and raw GPS ingestion. Those are upstream systems we depend on and treat as given. We consume the base ETA; we do not build it.
Who asks this and what they probe
The Switcher angle matters: most of this system is a serving problem you already know — caches, budgets, fallback, freshness. The new vocabulary is residual learning, quantile calibration, and feature/training skew. Lean on the serving instincts and layer the ML reasoning; do not try to out-MLE the MLE on architecture.
Requirements, scale, and the latency budget
Functional requirements
Given (origin, destination, request_time, request_type) — where request_type distinguishes a rideshare pickup leg from a delivery dropoff leg — return a corrected ETA plus an uncertainty estimate (quantiles, not just a point). The base ETA from the routing engine is passed in as part of the request, so the residual model and the fallback both have it locally.
Non-functional requirements
- Latency: p50 of a few milliseconds; p99 comfortably under the caller's budget. Public writeups put the DeeprETA forward pass at a few milliseconds; a single-digit-millisecond model component (low single digits at the median) is a defensible target for the model alone.
- Availability: effectively 100%. A missing ETA breaks both the rider map and dispatch. This is why fallback is a hard requirement rather than a nicety.
- Freshness: live segment speeds should be seconds-to-low-minutes fresh; embeddings and static geo features may be hours-to-days stale.
- Calibration: quantiles must mean what they say — a P90 should be exceeded about 10% of the time.
Scale envelope
ETA is called on every map render, every dispatch candidate, and every iteration of batch matching. A single dispatch decision may evaluate many driver-rider pairs, each needing an ETA. This fans out to hundreds of thousands to over 1M QPS at peak — the single highest-QPS prediction service in the company. Every microsecond and every byte on the hot path is multiplied by that number.
The latency budget
The budget is the spine of the design. An illustrative end-to-end target around 15 ms decomposes like this:
The critical move: the model timeout is set strictly below the SLO (for example, an 8 ms model timeout inside a 15 ms budget). When the model is slow, we abandon it and return the base ETA and still meet the SLO. The budget is designed so that failing safely is also failing fast.
Estimation and the serving hot path
This step decomposes the request flow and the two data paths that feed it: the read path on the hot critical path, and the write path that keeps features fresh.
The read path
The routing service calls the ETA service with the base ETA already attached. The ETA service fetches features, calls the model server, applies the residual, clamps the result, and returns P50/P90 plus a fallback_used flag so callers and monitoring can see when the model was bypassed.
Because the base ETA arrives as an input, the fallback path needs zero extra network calls. That single design choice is what makes a deterministic floor cheap.
The model server
The residual model — a linear transformer (justified in Step 6) — is served behind a low-latency RPC, in the spirit of an online model-serving platform like Uber’s Michelangelo online prediction service. The routing service issues the prediction request and merges the returned residual onto its base ETA. Keeping the merge in the caller means even a totally dead model server degrades to “use base_eta” trivially.
The write path — feature freshness
Two tiers write into the online store. Live traffic flows through Kafka and a stream processor and lands seconds-fresh. Embedding tables and static geo features are computed in batch and published on an hours cadence. The serving path only ever reads the online store.
Caching and interpolation
A shared lookup table caches fresh predictions and features for predefined supersegments, refreshed periodically. Rather than recompute every horizon per request, the server interpolates between adjacent precomputed horizons. This moves work off the hot path and amortizes it across the enormous request volume.
Parity by construction
The offline training store and the online serving store are physically separate but share a single feature definition. This is non-negotiable: the number-one source of silent ETA regressions is training-serving skew, where a feature is computed one way offline and another way online, and the model looks great in evaluation while quietly biasing production. One definition, two materializations.
API and consumer contract
This step defines the request/response contract and the multi-consumer schema that downstream systems depend on.
Request
base_eta_secs is required, not optional. The contract enforces that the floor is always present, so the service can never be in a state where it has no safe value to return.
Response
Multi-consumer contract
Two consumers read different fields from the same response:
- Rider UI wants a stable point estimate (p50_eta_secs). Jitter feels broken to a rider watching the map, so this field is smoothed and stable.
- Dispatch / batch matching wants quantiles (p90_eta_secs) to reason about the worst-case pickup time when matching a batch of riders to drivers.
One model, one versioned schema, different fields consumed. The schema is versioned (model_version) so we can evolve fields — add a P95, change clamping — without breaking either consumer. Defaulting to a single bare number is the classic Senior simplification that quietly starves dispatch of the uncertainty it needs.
Data model — features, encoding, and geospatial generalization
This step covers the feature taxonomy, how features are encoded for the model, and how locations are represented so the model generalizes across geographies.
Feature taxonomy
The routing engine’s base ETA is itself a feature: it anchors the residual and lets the model learn “in this context the base tends to run a bit optimistic.”
Feature encoding
Bucketize continuous features and embed all categoricals. In DeepETA’s ablations, bucketing continuous inputs beat feeding them raw. Each feature — continuous-bucketed or categorical — becomes a vector, like a token, except each token represents a feature, not a word. The model then reasons over feature interactions the way a transformer reasons over a sequence.
Geospatial generalization — why raw lat/lng fails
Raw lat/lng or a single city ID overfits dense metros and collapses in sparse regions: the model memorizes downtown and has nothing meaningful to say in a thin suburb. The fix, following DeeprETA:
1. Quantize each location into multiple resolution grids via geohash.
2. Hash each grid cell with an independent hash function per resolution to crush cardinality.
3. Learn an embedding per hash bin.
Multiple resolutions directly attack sparsity. Coarse grids generalize where data is thin; fine grids specialize where it is dense. The model leans on whichever resolution has signal. Lookup is O(1) — quantize, hash, fetch — which is exactly what makes this serveable inside a few-millisecond budget. It is a deliberate space-time tradeoff: large embedding tables learned offline precompute partial answers and move compute out of the request path.
Freshness tiers and parity
Each feature is tagged with an acceptable staleness. Live segment speeds (seconds-minutes, streamed) are alerted on if the pipeline lags; learned embeddings and static geo features (hours-days, batch) tolerate much more lag. Online and offline features are computed from the same definition so the training distribution matches serving — skew here is the failure where the model dazzles offline and silently biases production ETAs.
Cold start and sparse regions
With no live speeds available, the live-traffic features degrade to coarse priors and the predicted residual shrinks toward zero. The system leans on the routing engine — which is precisely the safe behavior. Sparse data does not produce a wild guess; it produces “trust the physical model.”
High-level architecture — reliability and the deterministic floor
This is the SDE backbone of the question: the fallback contract, the failure taxonomy, and the health signals that keep a learned component safe on the critical path.
The fallback contract
On model timeout, NaN or out-of-range output, feature-store miss, or model-server unavailability, the service returns the raw routing-engine base ETA. Because the base ETA arrived as an input feature, fallback requires no extra dependency or network call — it is a local read of a value we already hold.
The model is a strictly optional enhancer. Worst case, it no-ops and we ship the physical ETA. This is the core safety property residual framing buys: the blast radius of any model bug is bounded by “no correction,” never “absurd ETA.”
Failure-to-response taxonomy
Hard timeout below the SLO
The model timeout sits below the SLO (for example, 8 ms model timeout within a 15 ms budget) so that even when we abandon the model, the fallback path returns within SLO. Failing safe and failing fast are the same code path.
Health signals
- Fallback rate is the headline metric. A rising fallback rate means the model is silently absent — accuracy can look fine in aggregate while the model is barely serving. This is the canary for the whole system.
- p99 latency, quantile coverage, and per-region residual bias round out the dashboard.
Deployment safety
Ship in stages: shadow the candidate on live traffic — logging its outputs without serving them — to validate latency and calibration on real distributions; then canary a small traffic slice with auto-rollback if latency, error, or calibration breach thresholds. The deployment pipeline treats a calibration regression as a rollback trigger, not just a latency or error regression.
Deep dive — modeling, calibration, and where Staff is won
WHERE STAFF IS WONThis is the longest section because it is where a Staff answer separates from a Senior one: the model architecture chosen against the latency budget, the loss tied to business cost, calibrated quantiles, and the closed retraining loop connected to dispatch decisions.
Architecture choice, decided by the SLO
DeepETA evaluated seven architectures: MLP, Neural ODE, TabNet, Sparse Mixture-of-Experts, HyperNetworks, full Transformer, and Linear Transformer. The deciding constraint is latency, not raw accuracy.
A full self-attention transformer is O(K²) in the number of feature tokens K — it materializes a K×K attention matrix and blows the millisecond budget. The Linear Transformer uses the kernel trick to approximate attention without ever forming that matrix, collapsing the cost to linear in K while keeping most of the feature-interaction power. The accuracy/latency trade is decided by the SLO, and we say so explicitly: we choose the linear transformer because it hits low single-digit milliseconds, and we accept the small expressiveness loss versus full attention as the price of being serveable at 1M+ QPS.
Loss — asymmetric Huber tied to business cost
The loss has two jobs:
- Huber (parameter δ) for outlier robustness — GPS noise and rare pathological trips should not dominate the gradient the way squared error would.
- Asymmetry (parameter ω) so that underprediction (arriving late) is penalized more than overprediction (arriving early).
That asymmetry is not a modeling flourish — it encodes a real business cost. A rider told “2 minutes” who waits 6 is a worse outcome than one told “6 minutes” who waits 2. The loss makes the model prefer to be slightly pessimistic, matching the asymmetric cost of lateness.
Calibration — quantiles, not a bare point
The model emits calibrated quantiles (for example P50 and P90) via quantile / asymmetric loss, so downstream consumers receive uncertainty, not a bare point. For delivery especially, the 95th-percentile error matters as much as the mean — a customer cares about the worst plausible wait, not the average. A single point estimate is simply the wrong contract for dispatch and delivery.
Output clamping
The output is clamped to a sane range. A residual that would push the ETA negative or absurdly large is treated as a model failure and triggers fallback to the base ETA. Clamping is the last line of defense before a bad number reaches a rider or a matching decision.
Closing the loop — realized vs predicted
Every completed trip yields a realized arrival time. The realized-vs-predicted residual error is simultaneously the training label and the live quality signal. Retraining is driven by the error distribution — when realized error drifts, we retrain — not by a fixed calendar. The same signal that teaches the next model also tells us the current one is going stale.
Calibration drift — what Staff watches that Senior doesn’t
Monitor whether P90 predictions actually cover about 90% of outcomes. Coverage drift, not just MAE drift, is the early warning that the model has gone stale. A Senior watches accuracy; a Staff watches calibration, because a model can hold its MAE while its uncertainty estimates quietly decay — and dispatch is consuming those uncertainty estimates.
Per-region, per-time bias
Track residual bias per region and per time bucket. This catches the insidious failure where global MAE looks fine but the model systematically under-predicts in one city at rush hour — exactly the skew that quietly corrupts dispatch in that market while the aggregate dashboard stays green.
Staff insight — ETA error propagates into matching cost
The reason lateness is penalized harder is not that MAE looks nicer. ETA error propagates into matching cost: an over-optimistic ETA causes the matcher to assign a driver who then arrives late, producing a bad assignment and a poor rider experience. The asymmetric loss is tied directly to this — we penalize lateness because lateness is what corrupts downstream dispatch decisions. The model’s loss function is, in effect, a dispatch-quality lever.
Staff insight — the routing engine as the contractual floor
The deepest argument for residual framing is organizational. The routing engine is the contractual floor: product, dispatch, and on-call can all reason about a system whose worst case is “the physical ETA we already trusted.” That bounded worst case is what makes it organizationally acceptable to put a learned model on the critical path at all. You are not asking the org to bet the dispatch system on a neural network — you are asking it to let a neural network optionally improve a number it already trusts.
Rollout, scaling, and the QPS reality
This step covers how the system scales horizontally, the cost levers that make a 1M+ QPS fleet affordable, and the rollout posture.
Horizontal scaling
At over 1M QPS the model-server fleet is large. Two properties keep per-request compute low enough to make that fleet affordable: O(1) embedding-table lookups (no per-request graph computation) and a compact linear transformer (small enough to run on CPU). The bulk of the parameters live in offline-learned embedding tables; any one prediction touches only a tiny fraction of them, so per-request compute stays small.
Cost levers
- Supersegment caching + horizon interpolation cut redundant computation across overlapping requests.
- Request batching within a few milliseconds amortizes the forward pass across many requests without breaking the latency budget.
- Quantize / compile the model for CPU serving — per-request compute is small, so CPU is cheap and avoids GPU scheduling overhead at this fan-out.
Hot-path optimizations
- Precompute and cache embeddings so the request path only does lookups.
- Co-locate the online feature store with the model server to shave the feature-fetch portion of the budget — network hops are the dominant cost at single-digit-millisecond targets.
Regional sharding and rollout
Serve models close to the geography to cut network latency and to allow per-region model variants where traffic patterns differ sharply. Roll out new models region by region behind shadow and canary so a regression is contained to one market.
Backpressure
If the feature store or model server saturates, shed load to the fallback (base ETA) rather than queueing. On this path, degraded-but-fast beats slow-and-correct — a rider gets a slightly-less-precise ETA instantly instead of a perfect one too late to matter. The fallback doubles as the overload valve.
Bottlenecks, tradeoffs, and what you'd cut
This step is honest about the alternatives considered and what gets deferred for v1.
Residual post-processing vs end-to-end learned routing
Residual is safer, far cheaper to serve, and ships incrementally on top of an existing routing engine. A fully learned GNN over the road graph (à la Google Maps / DeepMind, which has reported accuracy gains up to around 50% in some cities) is more powerful but heavier to serve and much harder to bound. That is a longer-horizon bet, not the v1. The residual gives most of the benefit with a fraction of the operational risk.
Point vs quantile output
Quantiles cost a bit more to train and serve and complicate the contract, but they are what dispatch needs to reason about worst-case pickup. Defaulting to a point estimate is a classic Senior simplification that quietly hurts matching quality without ever showing up as a model-accuracy regression.
Full transformer vs linear transformer
Full self-attention is more expressive but O(K²) and breaks the latency budget. The choice is dictated by the SLO — and stating that explicitly is itself a signal of Staff-level judgment: you let the production constraint pick the architecture, rather than picking the fanciest model and hoping it fits.
What you’d cut for v1
Per-region model variants, the GNN, and exotic features all wait. Ship the residual linear transformer with rock-solid fallback and calibration monitoring first, because the reliability of this path matters more than the last point of MAE.
Honest closer
The hard, interesting part of this problem is not the model — it is the serving discipline (freshness, budget, fallback, calibration) that lets a learned component sit on the highest-QPS, most safety-critical path in the company.
Summary
A checklist of the load-bearing decisions, with the four Staff-vs-Senior separators in bold.
- Residual framing as a safety property: the routing engine is the floor; the model can only improve or no-op, never break the ETA. The blast radius of any model bug is bounded by "no correction."
- Latency budget decomposed (feature fetch + ~3-4 ms model + serialization) with a model timeout set below the SLO so the fallback always fits within budget.
- Calibrated quantiles via asymmetric Huber loss, tied to the real cost of lateness and to dispatch / batch-matching decisions — not a bare point estimate.
- Closed loop on realized-vs-predicted error driving both retraining and online calibration / fallback-rate monitoring, with shadow then canary then auto-rollback deployment.
- Geospatial generalization via multi-resolution geohash + feature hashing + learned embeddings, so the model degrades gracefully into the routing ETA where data is sparse.
- The deterministic fallback and the base ETA passed in as a feature, so the floor is always present and free to reach.
The one-liner to leave them with: “I’m not building a model that predicts ETA — I’m building a serving system that lets a model safely correct an ETA we already trust.”
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.