AI System DesignStaffEdge MLOTA & Quantization

Design an On-Device / Edge ML System

An edge-ML design where the datacenter assumptions break: no elastic GPUs, a fixed power/thermal envelope, no server-side request log, and a model fleet you ship over the air to millions of heterogeneous devices. Companies like Meta (ExecuTorch), Apple, Google (Pixel), and Tesla (FSD) build and interview on exactly these constraints. Staff candidates are judged on the on-device/server split, OTA versioning with safe rollback, and how you evaluate a model you can't directly observe.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Splits on-device vs server from explicit budgets (latency, battery/power, privacy, offline), not vibes

·Designs OTA with delta updates, device-tier targeting, A/B partition + atomic rollback

·Picks a compression path (INT8 default, INT4 + distillation when tight) with defensible size/accuracy numbers

·Evaluates without a server-side request log: shadow mode, on-device guardrails, federated/DP aggregate metrics

★ STAFF-LEVEL SIGNALS

★Treats the power/thermal budget as a hard constraint and designs degradation modes (drop to smaller model, skip frames, fall back to server) instead of pretending it's elastic

★Makes the safety argument that a bad OTA model can never brick or strand the device — atomic A/B slots, watchdog, signed artifacts, pinned last-known-good

★Reasons about device heterogeneity as a fleet problem: capability tiers, per-tier model variants, and rollout cohorts rather than one model for all

★Closes the eval loop privately — quantifies what you give up (no per-request label) and reconstructs signal via federated analytics + DP without ever uploading raw inputs

Scope — frame the problem & nail the budgets

In the datacenter you buy more GPUs; on a phone or a car you get one power budget and a battery — design for the envelope, not the peak.

The fastest way to lose this interview is to say “put it on the edge, it’s faster”. That is a vibe, not a design. The real reframe is that the edge is a fixed, non-elastic budget: there is no autoscaler, no spare GPU to page in under load, no second machine to retry on. You get one SoC, one battery, one thermal mass, and a fleet of millions of devices you do not control and cannot log. Everything below is derived from that single constraint.

Who asks this & what they probe

These are the systems Meta (ExecuTorch), Apple, Google (Pixel), and Tesla (FSD) build — they ship inference onto hardware they hand to a user — and they interview AI/ML roles on exactly this shape of problem. The same prompt probes different things depending on your lane.

Role

What they probe

SDE

On-device/server split, OTA versioning across device tiers, offline + graceful fallback, A/B and rollback without a server-side request log

MLE

Quantization/distillation to hit a memory+battery+thermal envelope, the accuracy-vs-size frontier, calibration, evaluating data that never leaves the device

Switcher (SDE → AI)

Map your staged-rollout / blue-green / versioning instincts onto model OTA, then add the new muscle: accuracy budgets and the fact that you cannot log the request

For the switcher specifically: an A/B-partition model update is a blue/green deploy, atomic-slot install is dual-partition firmware OTA, and cohort rollout is a canary. Lead with that distribution-and-reliability design you already own, then earn staff by putting numbers on the model-size-vs-accuracy and battery tradeoffs.

Pick one concrete workload to anchor

The design is identical in shape across workloads, but the numbers and failure modes differ — so commit to one and reference the others.

(a) Phone camera effect / segmentation — interactive, per-frame, raw camera is privacy-sensitive, must work with no signal.
(b) On-device pre-ranking — trims retrieval's ~10³ candidates to ~10² before a server does the expensive final rank; cuts payload and survives flaky connectivity.
(c) Multi-camera vehicle perception — 8 cameras, hard real-time, a safety control loop, a hard ~100W chip budget.

I’ll anchor on (a) the phone camera effect because it exercises every constraint — latency, battery, privacy, offline — and pull in (b) and (c) where they sharpen a point (cascades from recsys, thermal hard-limits from the vehicle).

Budgets — write down the actual envelope

This table is the spine of the whole design. Staff candidates fill it in with numbers; seniors leave it qualitative.

Budget

Phone camera effect (anchor)

Vehicle perception (contrast)

Latency

Under 100ms/frame interactive; ideally sub-frame

Real-time per frame, hard deadline on a control loop

Power / battery

A few mWh per inference; sustained use can't drain battery or burn the hand

NPU block ~7.5W of a ~100W chip budget (~7.5%)

Compute placement

NPU (markedly more TOPS/watt than GPU)

Dedicated NPU, e.g. ~144 TOPS class

Memory

Model + activations must fit RAM alongside the app

Fixed on-board memory, no swap

Thermal

Can't sustain-throttle the SoC during a session

Cores held under the thermal throttle threshold under sustained inference

Offline

Must fully work with zero connectivity

Must work; cannot depend on a network

Privacy

Raw camera frames never leave the device

Raw sensor data stays on-vehicle

Functional requirements

Run the model locally, producing a result within the latency budget every invocation.
Distribute new model versions over the air to a heterogeneous fleet without an app-store release.
Provide a server path for heavy / fresh / fallback computation — optional, never required.
Evaluate and A/B test models whose inputs and outputs never reach the server.

Non-functional requirements

Offline-first: defined behavior at zero connectivity.
Never blocks the UI thread (or, in a vehicle, the control loop).
Privacy: raw user/camera/keyboard data does not leave the device.
Safety: a bad OTA model cannot brick or strand the device.

Scope cut, stated out loud: I’m designing the inference + distribution + eval plane. I assume the base model is already trained; on-device/federated training is a related but separate design (I’ll gesture at it in the closing step because this system produces its signals).

The question this design must answer: given those budgets, what runs on-device vs on the server, how do I ship and roll back models like firmware, and how do I evaluate a model I can never observe?

Requirements in architecture — the on-device vs server split & the cascade

The split is a budget decision, not a preference

Rule of thumb: put it on-device if it is latency-critical (interactive frame / control loop), privacy-sensitive (raw camera, keyboard), or must work offline. Keep it on the server if it is heavy, freshness-dependent (global catalog, trending content), or too large to fit the memory budget.

Property of the work

Place it

Why

Latency-critical / per-frame

On-device

Round-trip blows the under-100ms budget

Privacy-sensitive raw input

On-device

Raw frames must not leave the device

Must work offline

On-device

Server is unreachable by assumption

Heavy compute / large model

Server

Won't fit the memory or thermal budget

Needs fresh / global signal

Server

Catalog and trends can't be learned locally

The cascade — small first stage local, heavy second stage remote

The split is rarely all-or-nothing; the winning pattern is a cascade where a cheap on-device stage does the time-critical work and an optional heavy server stage refines it.

on-device (every frame / request)

┌─────────────────────────────────────┐

input → │ infer_local() → {result, │

│ confidence, model_version} │

└───────────────┬─────────────────────┘

│

confident & fresh? ── yes ──→ use local result (done)

│

no (low conf / stale / throttled / offline)

▼

┌───────────────────────────┐

│ server second stage │ optional

│ (heavy model, global data) │ enhancement

└───────────────────────────┘

│

offline / server down

▼

cached or heuristic result

Pre-ranking cascade (recsys): on-device pre-ranking trims retrieval's ~10³ candidates to ~10² high-value ones; the server runs the expensive final rank. This cuts upload payload and keeps the feature usable on flaky connectivity (cf. EdgeRec-style cloud-to-edge reranking).
Perception cascade (vehicle): the real-time multi-camera first-stage detector runs every frame on-device; heavier map-building and fleet-learning happen off-board, asynchronously, never in the control loop.

The fallback contract is a first-class interface

Make the contract explicit so the server path is provably optional:

infer_local(input) -> { result, confidence, model_version }

decide():

if model_missing or model_stale(version): -> server_or_cached()

if confidence < ACCURACY_FLOOR: -> server_or_cached()

if device_thermally_throttled or over_budget: -> smaller_model | server

else: -> use local result

# server_or_cached() itself degrades to a cached/heuristic

# result when offline — the server is never a hard dependency.

Why not all-server: round-trip latency, connectivity dependence, and uploading raw frames violate the privacy and offline budgets at once. Why not all-on-device: model size, freshness, and global signals either don’t fit the memory budget or can’t be learned locally. The cascade is the only design that respects every budget in Step 0.

Estimation — make the model fit with quantization & distillation

This is where MLE depth shows. The target is not “smallest model” — it’s the smallest model that stays above the accuracy floor defined by the cascade.

Quantization

Quantization is the first and highest-leverage lever: store and compute weights/activations in lower precision.

Precision

Size vs FP32

Typical accuracy hit

When to use

FP32

1x (baseline)

—

Never ships to edge

INT8

~4x smaller

~2–5%

Default — broadly supported on NPUs

INT4

~8x smaller

~5–15%

Only when memory-bound

Concretely, a 3B-parameter model: ~12GB at FP32 → ~3GB at INT8 → ~1.5GB at INT4. INT8 is the default; escalate to INT4 only when you are genuinely memory-bound, and never naked — pair it with distillation/pruning to claw the accuracy back.

Distillation & structured pruning

When INT4 alone drops below the floor, recover accuracy by training a smaller student to match a larger teacher.

Distillation: train the small model against the logits of a larger teacher so it inherits the teacher's behavior, not just hard labels.
Structured pruning + distillation, real example: Llama 3.2 1B/3B were built by structured pruning from the 3.1 8B model, then distillation using logits from the 8B and 70B teachers — pruning shrinks, distillation restores quality.

Mixed precision & calibration — the accuracy-vs-size frontier

You don’t have to quantize uniformly. Per-channel quantization and mixed precision push you onto a better frontier:

Keep sensitive layers (e.g. attention/output projections) at FP16/INT8, quantize tolerant layers harder.
Good PTQ calibration data (a representative sample to set activation ranges) materially reduces the hit.
Apple-style 4-bit weight / 8-bit activation reaches roughly 85% size reduction vs FP32 while staying usable.

Concrete runtime flow (ExecuTorch / XNNPACK-style)

So this isn’t hand-wavy, here is the actual lowering pipeline:

# PyTorch graph -> quantized, delegated .pte artifact

ep = torch.export.export(model, example_inputs)

# Post-Training Quantization (PT2E) with XNNPACK quantizer:

# 8-bit symmetric weights, 8-bit asymmetric activations,

# per-channel weights

q = XNNPACKQuantizer().set_global(

get_symmetric_quantization_config(

is_per_channel=True))

m = prepare_pt2e(ep.module(), q)

run_calibration(m, calib_data) # sets activation ranges

m = convert_pt2e(m)

# Lower / partition to a delegate: XNNPACK (CPU) here;

# swap for a CoreML / NPU delegate on capable devices.

prog = to_edge_transform_and_lower(

torch.export.export(m, example_inputs),

partitioner=[XnnpackPartitioner()])

open("effect.pte","wb").write(prog.to_executorch().buffer)

# device runtime loads effect.pte and runs it.

Why NPU over CPU/GPU: the battery/thermal budget basically forces NPU execution. A dedicated NPU (e.g. a Hexagon-class block) runs a compact vision model like MobileNet-V2 roughly an order of magnitude faster than the mobile CPU and at markedly lower energy per inference. At several times more TOPS/watt than the GPU, the NPU is the only placement that fits the mWh-per-inference budget; treat the exact factors as device-dependent and benchmark on target hardware.

The accuracy floor: define a minimum on-device confidence/quality bar; below it, the cascade escalates to the server. The compression objective is “smallest model that stays above the floor,” and that floor is the contract between Step 2 (compression) and Step 1 (fallback).

API & distribution design — OTA model distribution & versioning

The mental model: ship models like firmware. Signed artifacts, dual slots, atomic switch, pinned last-known-good.

Packaging

Decouple the model from the app binary. Ship the .pte/model artifact through a model-delivery channel (CDN + a manifest), versioned independently of the app, so you can update the model without an app-store review cycle. The app binary references “the active model in slot X”; the manifest decides which artifact that is.

Mechanic

Choice

Why

Transport

CDN + signed manifest

Update models without app release

Payload

Delta (From→To patch), not full

Models are tens–hundreds of MB; users on mobile data

Integrity

Cryptographic signature + checksum

A tampered/corrupt model must never run

Install

A/B dual-slot, atomic switch

Instant rollback to last-known-good

Targeting

Manifest maps tier → variant

INT4 to low-end, INT8/FP16 to flagships

Rollout

Delta updates: send only the diff from the device’s current version (From-Version → To-Version patch), not the full artifact — critical at mobile-data scale.

A/B (dual-slot) install: download into the inactive slot, verify signature + checksum + a smoke-test inference, then atomically switch the active slot. Keep the previous slot as the pinned last-known-good.

# Manifest entry (per device tier)

tier: "flagship-npu"

to_version: "effect-v42-int8"

from_version: "effect-v41-int8"

patch_url: "cdn://.../v41_to_v42.delta"

sha256: "…"

signature: "…" # verified before activation

smoke_test: required # 1 inference must succeed in slot

cohort: "10%" # staged rollout gate

Staged rollout cohorts: 1% → 10% → 50% → 100%, each gate held until health metrics (crash rate, latency, on-device quality proxy) clear. Any cohort regression halts and auto-rolls-back.

Rollback

A bad model can’t brick the device because rollback is built into the install, not bolted on: the previous slot is still intact and signed, so reverting is an atomic pointer flip back to last-known-good — no download, no app-store round-trip.

Per-tier targeting: the manifest maps device capability tier → the correct variant, so the INT4 variant goes to low-RAM / no-NPU devices and the INT8/FP16 variant goes to flagship NPUs. This sets up the next step.

Data model — device heterogeneity & the fleet

Heterogeneity is the defining edge constraint vs the datacenter, where machines are fungible. Here, one model becomes a small portfolio.

Capability tiers

Tier the fleet by NPU/accelerator class, RAM, OS/runtime version, and thermal headroom — not by device name. The same silicon on an old OS may not support a needed op or delegate, so the runtime is part of the tier.

Variant matrix

Tier

Accelerator / RAM / runtime

Model variant

Flagship

Modern NPU, high RAM, current runtime

FP16/INT8

Mid

Older NPU/GPU, moderate RAM

INT8

Low-end

Weak/no NPU, low RAM

INT4 + distilled

Floor

CPU-only or old runtime

Small CPU model / server

You maintain a handful of variants, not one model and not one per device — the device fetches the variant its tier supports.

Per-tier routing

Capability detection at runtime: the device reports its tier (accelerator present, available RAM, runtime version). The delivery manifest resolves tier → variant, so you never push a model that won't fit or won't run.
Per-tier rollout + fallback: a regression on the low-end INT4 variant rolls back without touching flagships; cohorts are scoped within a tier, so blast radius is a slice of one tier.
CPU-only floor / graceful degradation: there is always a guaranteed-to-run path — a small CPU model or a server call — so a device with no usable NPU still gets the feature, just at lower fidelity. The feature never silently disappears for a tier.

High-level architecture — eval & monitoring without a request log

Name the core problem first: the server never sees the request. Inputs and outputs stay on-device, so there is no per-request server log to compute online metrics from or to train the next model on. The standard A/B-plus-logging playbook breaks, and you must reconstruct signal without raw data.

On-device A/B (champion–challenger)

Run champion (current) and challenger (new) models on-device in shadow mode, compare locally, and report only aggregate deltas. The device is the experiment unit, and assignment is local — there is no server arm because there is no server view of the request.

Guardrail metrics, computed on-device

Cheap signals you can measure locally and aggregate:

Latency, frame drops, thermal throttling events.
Fallback-to-server rate (a rising rate is itself a drift alarm).
A cheap quality proxy: confidence distribution, and champion/challenger disagreement rate.

Privacy-preserving aggregate eval

To recover population-level accuracy without uploading raw inputs, combine three techniques — the production pattern behind Gboard / on-device keyboard and Siri-style telemetry:

Federated analytics — devices compute metrics locally, only summaries leave.
Secure aggregation — the server only ever sees the sum across many devices, never one device's contribution.
Differential privacy — calibrated noise so no individual is recoverable from the aggregate.

Drift detection without raw data

Monitor lightweight on-device input-distribution profiles (statistical sketches, whylogs-style) and the fallback/disagreement rate as proxy signals. You watch distributions over the fleet, not records.

Mapping: what you’d normally log → how you recover it on the edge

Server-side you'd normally have

Edge reconstruction

Per-request click/engagement log

DP-aggregated on-device engagement counters

Server-side A/B test

On-device champion–challenger shadow test

Server feature monitoring

Federated input-profile aggregates

Labeled eval set

Periodic on-device eval vs a shipped golden set

The honest admission — I give up the per-request label and reconstruct only aggregates — is exactly what staff interviewers want to hear stated plainly.

Deep dive — reliability, safety & the degradation ladder

WHERE STAFF IS WON

This is where staff is won. A datacenter design implicitly assumes elasticity. An edge design must treat the power/thermal/memory envelope as a fixed constraint and design explicit degradation modes for the moment it’s hit — the feature must degrade, never hard-fail.

The degradation ladder

Every adverse trigger has a defined, graceful action. Nothing here is a crash or a hang.

Trigger

Action

High latency / low battery

Switch to a smaller model variant

Thermal throttle

Reduce frame rate / skip every other frame

Low confidence

Escalate to the server second stage

Offline or server down

Cached or heuristic result

Model corrupt / missing

Revert to last-known-good slot

The ladder is ordered cheapest-recovery-first: shrink the model before you drop frames, drop frames before you reach for the network, reach for the network before you fall back to a heuristic.

Safety invariant #1 — a bad model can’t brick the device

A new model must never be able to take a device offline:

A/B dual-slot install — the old, working model is always still present in the other slot.
Signed artifact — verified before activation; a tampered or corrupt model never runs.
Smoke-test before activation — one successful inference in the inactive slot is required before the atomic switch.
Watchdog — if the new model crashes on boot or in early use, the device auto-reverts to last-known-good.

This is firmware-OTA discipline applied to models: the device can fail to adopt a model, but it can never be stranded by one.

Safety invariant #2 — bounded by the budget

A pathological model must never be able to stall the device:

A hard watchdog timeout on inference so a slow model can't block the UI thread — or, in a vehicle, the control loop.
If inference blows the latency/power budget, the call falls back rather than waiting. The budget, not the model, has the final say on deadline.

Thermal as a control input, not an afterthought

Sustained inference must hold cores under their thermal throttle threshold (the vehicle NPU draws ~7.5W of the ~100W chip budget). The right design de-rates proactively — the scheduler shrinks the model or drops the frame rate before the SoC throttles itself unpredictably. You’d rather choose a controlled, smaller-model degradation than let silicon thermal-throttle mid-frame and miss a deadline you can’t predict. This is the direct payoff of writing the thermal budget down in Step 0.

Offline-first contract & blast-radius control

Offline-first contract: the feature has a defined behavior at zero connectivity (local model + cached fallback). The server is an enhancement, never a hard dependency — restated from Step 1 because it's the safety backbone, not a nicety.
Blast-radius control: per-tier cohorts + auto-rollback mean a regression hits a slice of one tier, not the whole fleet, and recovers without an app-store release — the edge equivalent of a fast revert.

The staff tell: you treated power/thermal/memory as a fixed envelope and designed explicit degradation modes for it, instead of implicitly assuming the elasticity a datacenter design would.

Rollout strategy — staged fleet orchestration & fast revert

Step 3 covered the mechanics of an OTA install (delta, signature, dual-slot, atomic switch). This step is the operational orchestration across a fleet you can’t observe per-request: how a new model walks from 1% to 100% safely, and how it backs out fast.

The canary ladder, gated on edge-only health signals

Because there is no server request log, every gate is held on the on-device proxies from Step 5, not server metrics:

Cohort

Gate before promoting

Watch for

Crash-free install + smoke-test pass rate

Boot/load failures, slot-switch errors

10%

Latency p95, thermal-throttle event rate

Frame drops, battery regressions

50%

Fallback-to-server rate, champion/challenger disagreement

Quality-proxy drift

100%

All guardrails stable across tiers

Tier-specific regressions

Each gate is scoped within a capability tier (Step 4), so a low-end INT4 regression never gates the flagship rollout and vice versa.

Halt and auto-revert

Automatic halt: if any guardrail (crash rate, latency, throttle rate, fallback rate) regresses past its threshold in the active cohort, promotion stops and the cohort auto-reverts by flipping back to the pinned last-known-good slot — no download, no app-store round-trip.
No-network-required revert: because revert is a local slot pointer flip, it works even for devices that are offline at the moment the recall is issued (they revert on next boot of the bad slot's watchdog).
Server kill-switch: the manifest can also pin the whole fleet to a known-good version, halting an in-flight rollout centrally without shipping anything new.

Staged exposure beyond cohorts

Time-soak per gate: hold each cohort long enough to catch slow regressions (battery drain, thermal creep) that a snapshot metric misses.
Tier-staggered start: roll the lowest-risk tier (flagships, best telemetry) first; promote to constrained tiers only once the variant is proven, since the low-end variant is the most likely to blow a budget.

The whole point: a rollout is reversible at every step, scoped to a slice of one tier, and gated on signals you can actually compute on the edge.

Bottlenecks, observability & evolution — closing the loop with federated improvement

The closing scope: where this system’s bottlenecks live, how it observes itself, and how it hands off to the next model. Flagging the training boundary as beyond-the-core-scope-but-aware — the panel will respect that you know where this system ends.

Where the bottlenecks actually are

Bottleneck

Why it bites on the edge

Mitigation

Memory / activation peak

No swap; OOM is a hard crash

Variant matrix + INT4 floor; cap batch/resolution

Thermal sustain

Throttling silently degrades latency

Proactive de-rate ladder (Step 6)

OTA payload

Tens–hundreds of MB on mobile data

Delta patches; tier-targeted variants

Observability gap

No per-request log to debug from

DP-aggregated proxies, not raw records

Fleet skew

Long tail of old runtimes/silicon

Capability tiers + guaranteed CPU floor

The observability story is the one that surprises people: you debug a fleet you can’t see by watching fleet-level distributions (fallback rate, disagreement rate, input-profile sketches) shift, never individual sessions.

How the next model gets better without centralizing data

Federated learning closes the loop: devices compute model updates locally, and only encrypted/aggregated gradients leave the device via secure aggregation — the next model improves without ever centralizing raw user data. This is the Gboard / Pixel production pattern.
Privacy hardening: add DP noise to updates plus client attestation (TEE) so an individual update can't be reverse-engineered; use partial participation, rate limits, and timeouts because the fleet is mostly offline at any moment.
Honest scoping: on-device/federated training is its own design question — but the inference and eval plane built here is exactly what produces its signals.

The handoff — the loop is the system

1. Drift alarms + DP-aggregated metrics (Step 5) fire the retraining trigger.

2. New base model → quantize / distill to the variant matrix (Step 2).

3. Package and OTA with delta + signed dual-slot install (Step 3).

4. Tiered, cohorted rollout 1% → 100% with halt/auto-revert (Steps 4 & 7).

5. Evaluate privately via shadow + federated/DP aggregates (Step 5) — which feeds step 1 again.

The system is not the model; the system is this loop.

✓

Summary

Thesis: an edge-ML design is won by treating the device’s power/thermal/memory budget as a hard, non-elastic constraint and designing the split, the OTA distribution, and the eval loop around the fact that you can’t log the request.

The senior → staff jump, in three moves:

1. Derive the on-device/server split from written-down budgets (latency, power, memory, thermal, offline, privacy) — not from “edge is faster.”

2. Ship OTA like firmware — signed artifacts, A/B dual slots, atomic switch, per-tier cohorts, pinned last-known-good — so a bad model can’t brick the fleet.

3. Evaluate privately — shadow champion–challenger + federated/DP aggregates — instead of pretending a server log exists.

Failure modes that tank it: assuming elastic compute; “just put it on the edge, it’s faster”; one model for all devices; no rollback story; hand-waving compression (“we quantize it”) with no numbers and no accuracy floor.

Defensible-numbers checklist to drop:

INT8 ≈ 4x smaller / 2–5% hit; INT4 ≈ 8x smaller / 5–15% hit.
NPU = several times the TOPS/watt of the mobile GPU and roughly an order of magnitude faster than the CPU on a compact vision model (benchmark on target hardware).
Vehicle ≈ 144 TOPS at ~100W; NPU ~7.5W (~7.5% of the chip budget); de-rate before the thermal throttle threshold.
Rollout cohorts 1% / 10% / 50% / 100%; pre-rank 10³ → 10².

Landing line: “On the edge there is no autoscaler — so I design for the envelope, ship models like firmware with atomic rollback, and reconstruct my metrics with federated, differentially-private telemetry because I never get to see the request.”

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing & budgets

States on-device vs server tradeoff qualitatively (latency, privacy). Picks one workload (camera effect or pre-ranking) and scopes it.

Writes down the actual envelope — e.g. NPU ~7.5W of a ~100W vehicle budget, or a phone target of <100ms and <X mWh/inference — and derives the split from those numbers plus offline and privacy requirements.

On-device / server split

Puts the model on-device for latency/privacy; calls the server for heavy or fresh computation.

Designs a cascade: on-device pre-filter/pre-rank (e.g. 10³→10² candidates) or first-stage perception, server for the expensive second stage, with an explicit fallback contract when local inference is unavailable, stale, or throttled.

Model compression (MLE depth)

Says 'quantize to INT8' and maybe 'distill a smaller model.'

Chooses INT8 (≈4× smaller, ~2–5% accuracy hit) as default, escalates to INT4 (≈8×) + distillation/structured pruning only when memory-bound, names PTQ calibration and per-channel/mixed-precision, and states the accuracy floor that triggers server fallback.

OTA distribution & versioning

Pushes new model versions to devices via the app/update channel with staged rollout.

Delta/patch updates, signed artifacts, A/B (dual-slot) install with atomic switch + auto-rollback, per-device-tier targeting, rollout cohorts (1%/10%/50%/100%), and a pinned last-known-good so a bad model can't brick the device.

Device heterogeneity

Acknowledges some devices are slower; ships a smaller model to old phones.

Models the fleet as capability tiers (NPU class, RAM, OS, thermal headroom), maintains a small matrix of model variants per tier, and routes rollout + fallback per tier — including a CPU-only / no-NPU floor.

Eval & monitoring without a request log

Monitors crash rate and latency; runs a normal server A/B if possible.

Accepts there is no per-request server log: uses shadow/champion-challenger on-device, on-device guardrail metrics, and federated analytics + differential privacy / secure aggregation to recover aggregate accuracy and drift signals without uploading raw inputs.

Reliability & failure modes

Falls back to server if the model fails or the device is offline.

Enumerates the degradation ladder — thermal throttle → smaller model → frame/feature skip → server fallback → cached/heuristic — plus watchdog timeouts, version pinning, and an offline-first contract so the feature degrades, never hard-fails.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →

Scope — frame the problem & nail the budgets

Requirements in architecture — the on-device vs server split & the cascade

Estimation — make the model fit with quantization & distillation

API & distribution design — OTA model distribution & versioning

Data model — device heterogeneity & the fleet

High-level architecture — eval & monitoring without a request log

Deep dive — reliability, safety & the degradation ladder

Rollout strategy — staged fleet orchestration & fast revert

Bottlenecks, observability & evolution — closing the loop with federated improvement

Summary

Rubric — Senior vs Staff

Related questions

Want more breakdowns like this?