Design an On-Device / Edge ML System
An edge-ML design where the datacenter assumptions break: no elastic GPUs, a fixed power/thermal envelope, no server-side request log, and a model fleet you ship over the air to millions of heterogeneous devices. Companies like Meta (ExecuTorch), Apple, Google (Pixel), and Tesla (FSD) build and interview on exactly these constraints. Staff candidates are judged on the on-device/server split, OTA versioning with safe rollback, and how you evaluate a model you can't directly observe.
Scope — frame the problem & nail the budgets
In the datacenter you buy more GPUs; on a phone or a car you get one power budget and a battery — design for the envelope, not the peak.
The fastest way to lose this interview is to say “put it on the edge, it’s faster”. That is a vibe, not a design. The real reframe is that the edge is a fixed, non-elastic budget: there is no autoscaler, no spare GPU to page in under load, no second machine to retry on. You get one SoC, one battery, one thermal mass, and a fleet of millions of devices you do not control and cannot log. Everything below is derived from that single constraint.
Who asks this & what they probe
These are the systems Meta (ExecuTorch), Apple, Google (Pixel), and Tesla (FSD) build — they ship inference onto hardware they hand to a user — and they interview AI/ML roles on exactly this shape of problem. The same prompt probes different things depending on your lane.
For the switcher specifically: an A/B-partition model update is a blue/green deploy, atomic-slot install is dual-partition firmware OTA, and cohort rollout is a canary. Lead with that distribution-and-reliability design you already own, then earn staff by putting numbers on the model-size-vs-accuracy and battery tradeoffs.
Pick one concrete workload to anchor
The design is identical in shape across workloads, but the numbers and failure modes differ — so commit to one and reference the others.
- (a) Phone camera effect / segmentation — interactive, per-frame, raw camera is privacy-sensitive, must work with no signal.
- (b) On-device pre-ranking — trims retrieval's ~10³ candidates to ~10² before a server does the expensive final rank; cuts payload and survives flaky connectivity.
- (c) Multi-camera vehicle perception — 8 cameras, hard real-time, a safety control loop, a hard ~100W chip budget.
I’ll anchor on (a) the phone camera effect because it exercises every constraint — latency, battery, privacy, offline — and pull in (b) and (c) where they sharpen a point (cascades from recsys, thermal hard-limits from the vehicle).
Budgets — write down the actual envelope
This table is the spine of the whole design. Staff candidates fill it in with numbers; seniors leave it qualitative.
Functional requirements
- Run the model locally, producing a result within the latency budget every invocation.
- Distribute new model versions over the air to a heterogeneous fleet without an app-store release.
- Provide a server path for heavy / fresh / fallback computation — optional, never required.
- Evaluate and A/B test models whose inputs and outputs never reach the server.
Non-functional requirements
- Offline-first: defined behavior at zero connectivity.
- Never blocks the UI thread (or, in a vehicle, the control loop).
- Privacy: raw user/camera/keyboard data does not leave the device.
- Safety: a bad OTA model cannot brick or strand the device.
Scope cut, stated out loud: I’m designing the inference + distribution + eval plane. I assume the base model is already trained; on-device/federated training is a related but separate design (I’ll gesture at it in the closing step because this system produces its signals).
The question this design must answer: given those budgets, what runs on-device vs on the server, how do I ship and roll back models like firmware, and how do I evaluate a model I can never observe?
Requirements in architecture — the on-device vs server split & the cascade
The split is a budget decision, not a preference
Rule of thumb: put it on-device if it is latency-critical (interactive frame / control loop), privacy-sensitive (raw camera, keyboard), or must work offline. Keep it on the server if it is heavy, freshness-dependent (global catalog, trending content), or too large to fit the memory budget.
The cascade — small first stage local, heavy second stage remote
The split is rarely all-or-nothing; the winning pattern is a cascade where a cheap on-device stage does the time-critical work and an optional heavy server stage refines it.
- Pre-ranking cascade (recsys): on-device pre-ranking trims retrieval's ~10³ candidates to ~10² high-value ones; the server runs the expensive final rank. This cuts upload payload and keeps the feature usable on flaky connectivity (cf. EdgeRec-style cloud-to-edge reranking).
- Perception cascade (vehicle): the real-time multi-camera first-stage detector runs every frame on-device; heavier map-building and fleet-learning happen off-board, asynchronously, never in the control loop.
The fallback contract is a first-class interface
Make the contract explicit so the server path is provably optional:
Why not all-server: round-trip latency, connectivity dependence, and uploading raw frames violate the privacy and offline budgets at once. Why not all-on-device: model size, freshness, and global signals either don’t fit the memory budget or can’t be learned locally. The cascade is the only design that respects every budget in Step 0.
Estimation — make the model fit with quantization & distillation
This is where MLE depth shows. The target is not “smallest model” — it’s the smallest model that stays above the accuracy floor defined by the cascade.
Quantization
Quantization is the first and highest-leverage lever: store and compute weights/activations in lower precision.
Concretely, a 3B-parameter model: ~12GB at FP32 → ~3GB at INT8 → ~1.5GB at INT4. INT8 is the default; escalate to INT4 only when you are genuinely memory-bound, and never naked — pair it with distillation/pruning to claw the accuracy back.
Distillation & structured pruning
When INT4 alone drops below the floor, recover accuracy by training a smaller student to match a larger teacher.
- Distillation: train the small model against the logits of a larger teacher so it inherits the teacher's behavior, not just hard labels.
- Structured pruning + distillation, real example: Llama 3.2 1B/3B were built by structured pruning from the 3.1 8B model, then distillation using logits from the 8B and 70B teachers — pruning shrinks, distillation restores quality.
Mixed precision & calibration — the accuracy-vs-size frontier
You don’t have to quantize uniformly. Per-channel quantization and mixed precision push you onto a better frontier:
- Keep sensitive layers (e.g. attention/output projections) at FP16/INT8, quantize tolerant layers harder.
- Good PTQ calibration data (a representative sample to set activation ranges) materially reduces the hit.
- Apple-style 4-bit weight / 8-bit activation reaches roughly 85% size reduction vs FP32 while staying usable.
Concrete runtime flow (ExecuTorch / XNNPACK-style)
So this isn’t hand-wavy, here is the actual lowering pipeline:
Why NPU over CPU/GPU: the battery/thermal budget basically forces NPU execution. A dedicated NPU (e.g. a Hexagon-class block) runs a compact vision model like MobileNet-V2 roughly an order of magnitude faster than the mobile CPU and at markedly lower energy per inference. At several times more TOPS/watt than the GPU, the NPU is the only placement that fits the mWh-per-inference budget; treat the exact factors as device-dependent and benchmark on target hardware.
The accuracy floor: define a minimum on-device confidence/quality bar; below it, the cascade escalates to the server. The compression objective is “smallest model that stays above the floor,” and that floor is the contract between Step 2 (compression) and Step 1 (fallback).
API & distribution design — OTA model distribution & versioning
The mental model: ship models like firmware. Signed artifacts, dual slots, atomic switch, pinned last-known-good.
Packaging
Decouple the model from the app binary. Ship the .pte/model artifact through a model-delivery channel (CDN + a manifest), versioned independently of the app, so you can update the model without an app-store review cycle. The app binary references “the active model in slot X”; the manifest decides which artifact that is.
Rollout
Delta updates: send only the diff from the device’s current version (From-Version → To-Version patch), not the full artifact — critical at mobile-data scale.
A/B (dual-slot) install: download into the inactive slot, verify signature + checksum + a smoke-test inference, then atomically switch the active slot. Keep the previous slot as the pinned last-known-good.
Staged rollout cohorts: 1% → 10% → 50% → 100%, each gate held until health metrics (crash rate, latency, on-device quality proxy) clear. Any cohort regression halts and auto-rolls-back.
Rollback
A bad model can’t brick the device because rollback is built into the install, not bolted on: the previous slot is still intact and signed, so reverting is an atomic pointer flip back to last-known-good — no download, no app-store round-trip.
Per-tier targeting: the manifest maps device capability tier → the correct variant, so the INT4 variant goes to low-RAM / no-NPU devices and the INT8/FP16 variant goes to flagship NPUs. This sets up the next step.
Data model — device heterogeneity & the fleet
Heterogeneity is the defining edge constraint vs the datacenter, where machines are fungible. Here, one model becomes a small portfolio.
Capability tiers
Tier the fleet by NPU/accelerator class, RAM, OS/runtime version, and thermal headroom — not by device name. The same silicon on an old OS may not support a needed op or delegate, so the runtime is part of the tier.
Variant matrix
You maintain a handful of variants, not one model and not one per device — the device fetches the variant its tier supports.
Per-tier routing
- Capability detection at runtime: the device reports its tier (accelerator present, available RAM, runtime version). The delivery manifest resolves tier → variant, so you never push a model that won't fit or won't run.
- Per-tier rollout + fallback: a regression on the low-end INT4 variant rolls back without touching flagships; cohorts are scoped within a tier, so blast radius is a slice of one tier.
- CPU-only floor / graceful degradation: there is always a guaranteed-to-run path — a small CPU model or a server call — so a device with no usable NPU still gets the feature, just at lower fidelity. The feature never silently disappears for a tier.
High-level architecture — eval & monitoring without a request log
Name the core problem first: the server never sees the request. Inputs and outputs stay on-device, so there is no per-request server log to compute online metrics from or to train the next model on. The standard A/B-plus-logging playbook breaks, and you must reconstruct signal without raw data.
On-device A/B (champion–challenger)
Run champion (current) and challenger (new) models on-device in shadow mode, compare locally, and report only aggregate deltas. The device is the experiment unit, and assignment is local — there is no server arm because there is no server view of the request.
Guardrail metrics, computed on-device
Cheap signals you can measure locally and aggregate:
- Latency, frame drops, thermal throttling events.
- Fallback-to-server rate (a rising rate is itself a drift alarm).
- A cheap quality proxy: confidence distribution, and champion/challenger disagreement rate.
Privacy-preserving aggregate eval
To recover population-level accuracy without uploading raw inputs, combine three techniques — the production pattern behind Gboard / on-device keyboard and Siri-style telemetry:
- Federated analytics — devices compute metrics locally, only summaries leave.
- Secure aggregation — the server only ever sees the sum across many devices, never one device's contribution.
- Differential privacy — calibrated noise so no individual is recoverable from the aggregate.
Drift detection without raw data
Monitor lightweight on-device input-distribution profiles (statistical sketches, whylogs-style) and the fallback/disagreement rate as proxy signals. You watch distributions over the fleet, not records.
Mapping: what you’d normally log → how you recover it on the edge
The honest admission — I give up the per-request label and reconstruct only aggregates — is exactly what staff interviewers want to hear stated plainly.
Deep dive — reliability, safety & the degradation ladder
WHERE STAFF IS WONThis is where staff is won. A datacenter design implicitly assumes elasticity. An edge design must treat the power/thermal/memory envelope as a fixed constraint and design explicit degradation modes for the moment it’s hit — the feature must degrade, never hard-fail.
The degradation ladder
Every adverse trigger has a defined, graceful action. Nothing here is a crash or a hang.
The ladder is ordered cheapest-recovery-first: shrink the model before you drop frames, drop frames before you reach for the network, reach for the network before you fall back to a heuristic.
Safety invariant #1 — a bad model can’t brick the device
A new model must never be able to take a device offline:
- A/B dual-slot install — the old, working model is always still present in the other slot.
- Signed artifact — verified before activation; a tampered or corrupt model never runs.
- Smoke-test before activation — one successful inference in the inactive slot is required before the atomic switch.
- Watchdog — if the new model crashes on boot or in early use, the device auto-reverts to last-known-good.
This is firmware-OTA discipline applied to models: the device can fail to adopt a model, but it can never be stranded by one.
Safety invariant #2 — bounded by the budget
A pathological model must never be able to stall the device:
- A hard watchdog timeout on inference so a slow model can't block the UI thread — or, in a vehicle, the control loop.
- If inference blows the latency/power budget, the call falls back rather than waiting. The budget, not the model, has the final say on deadline.
Thermal as a control input, not an afterthought
Sustained inference must hold cores under their thermal throttle threshold (the vehicle NPU draws ~7.5W of the ~100W chip budget). The right design de-rates proactively — the scheduler shrinks the model or drops the frame rate before the SoC throttles itself unpredictably. You’d rather choose a controlled, smaller-model degradation than let silicon thermal-throttle mid-frame and miss a deadline you can’t predict. This is the direct payoff of writing the thermal budget down in Step 0.
Offline-first contract & blast-radius control
- Offline-first contract: the feature has a defined behavior at zero connectivity (local model + cached fallback). The server is an enhancement, never a hard dependency — restated from Step 1 because it's the safety backbone, not a nicety.
- Blast-radius control: per-tier cohorts + auto-rollback mean a regression hits a slice of one tier, not the whole fleet, and recovers without an app-store release — the edge equivalent of a fast revert.
The staff tell: you treated power/thermal/memory as a fixed envelope and designed explicit degradation modes for it, instead of implicitly assuming the elasticity a datacenter design would.
Rollout strategy — staged fleet orchestration & fast revert
Step 3 covered the mechanics of an OTA install (delta, signature, dual-slot, atomic switch). This step is the operational orchestration across a fleet you can’t observe per-request: how a new model walks from 1% to 100% safely, and how it backs out fast.
The canary ladder, gated on edge-only health signals
Because there is no server request log, every gate is held on the on-device proxies from Step 5, not server metrics:
Each gate is scoped within a capability tier (Step 4), so a low-end INT4 regression never gates the flagship rollout and vice versa.
Halt and auto-revert
- Automatic halt: if any guardrail (crash rate, latency, throttle rate, fallback rate) regresses past its threshold in the active cohort, promotion stops and the cohort auto-reverts by flipping back to the pinned last-known-good slot — no download, no app-store round-trip.
- No-network-required revert: because revert is a local slot pointer flip, it works even for devices that are offline at the moment the recall is issued (they revert on next boot of the bad slot's watchdog).
- Server kill-switch: the manifest can also pin the whole fleet to a known-good version, halting an in-flight rollout centrally without shipping anything new.
Staged exposure beyond cohorts
- Time-soak per gate: hold each cohort long enough to catch slow regressions (battery drain, thermal creep) that a snapshot metric misses.
- Tier-staggered start: roll the lowest-risk tier (flagships, best telemetry) first; promote to constrained tiers only once the variant is proven, since the low-end variant is the most likely to blow a budget.
The whole point: a rollout is reversible at every step, scoped to a slice of one tier, and gated on signals you can actually compute on the edge.
Bottlenecks, observability & evolution — closing the loop with federated improvement
The closing scope: where this system’s bottlenecks live, how it observes itself, and how it hands off to the next model. Flagging the training boundary as beyond-the-core-scope-but-aware — the panel will respect that you know where this system ends.
Where the bottlenecks actually are
The observability story is the one that surprises people: you debug a fleet you can’t see by watching fleet-level distributions (fallback rate, disagreement rate, input-profile sketches) shift, never individual sessions.
How the next model gets better without centralizing data
- Federated learning closes the loop: devices compute model updates locally, and only encrypted/aggregated gradients leave the device via secure aggregation — the next model improves without ever centralizing raw user data. This is the Gboard / Pixel production pattern.
- Privacy hardening: add DP noise to updates plus client attestation (TEE) so an individual update can't be reverse-engineered; use partial participation, rate limits, and timeouts because the fleet is mostly offline at any moment.
- Honest scoping: on-device/federated training is its own design question — but the inference and eval plane built here is exactly what produces its signals.
The handoff — the loop is the system
1. Drift alarms + DP-aggregated metrics (Step 5) fire the retraining trigger.
2. New base model → quantize / distill to the variant matrix (Step 2).
3. Package and OTA with delta + signed dual-slot install (Step 3).
4. Tiered, cohorted rollout 1% → 100% with halt/auto-revert (Steps 4 & 7).
5. Evaluate privately via shadow + federated/DP aggregates (Step 5) — which feeds step 1 again.
The system is not the model; the system is this loop.
Summary
Thesis: an edge-ML design is won by treating the device’s power/thermal/memory budget as a hard, non-elastic constraint and designing the split, the OTA distribution, and the eval loop around the fact that you can’t log the request.
The senior → staff jump, in three moves:
1. Derive the on-device/server split from written-down budgets (latency, power, memory, thermal, offline, privacy) — not from “edge is faster.”
2. Ship OTA like firmware — signed artifacts, A/B dual slots, atomic switch, per-tier cohorts, pinned last-known-good — so a bad model can’t brick the fleet.
3. Evaluate privately — shadow champion–challenger + federated/DP aggregates — instead of pretending a server log exists.
Failure modes that tank it: assuming elastic compute; “just put it on the edge, it’s faster”; one model for all devices; no rollback story; hand-waving compression (“we quantize it”) with no numbers and no accuracy floor.
Defensible-numbers checklist to drop:
- INT8 ≈ 4x smaller / 2–5% hit; INT4 ≈ 8x smaller / 5–15% hit.
- NPU = several times the TOPS/watt of the mobile GPU and roughly an order of magnitude faster than the CPU on a compact vision model (benchmark on target hardware).
- Vehicle ≈ 144 TOPS at ~100W; NPU ~7.5W (~7.5% of the chip budget); de-rate before the thermal throttle threshold.
- Rollout cohorts 1% / 10% / 50% / 100%; pre-rank 10³ → 10².
Landing line: “On the edge there is no autoscaler — so I design for the envelope, ship models like firmware with atomic rollback, and reconstruct my metrics with federated, differentially-private telemetry because I never get to see the request.”
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.