Design an LLM Inference Serving Engine
A Staff-level walkthrough of the LLM inference serving engine that OpenAI, Anthropic, and Google DeepMind run behind their chat APIs — a goodput-maximizing token factory bounded by HBM. It pins down continuous batching over a paged KV cache, prefill/decode-aware SLO scheduling, prefix-cache reuse with cache-affinity routing, FP8 weights+KV behind eval gates, and adaptive speculative decoding, optimizing TTFT/TPOT goodput per GPU-dollar rather than raw tokens/sec.
Scope & ambiguity
Let me pin down the boundary first. I’m building the GPU-resident serving engine behind /v1/chat/completions — the thing that takes a tokenized request and turns weights plus a KV cache into a stream of tokens under latency SLOs. I’m not building the model (pretraining, fine-tuning), the retrieval layer that fills the prompt, or the agent loop above it; those are upstream and I treat them as clients. The one tension that governs every decision: HBM is the scarce resource, not QPS or FLOPs, and the objective is goodput — requests served within their latency budget — per GPU-dollar, not raw tokens/sec. This is the engine OpenAI, Anthropic, and Google DeepMind each run behind their chat APIs, and it’s a system that’s ~70% distributed-systems you already know wrapped around a few GPU-internal facts you have to get right.This is an AI-infrastructure design problem framed the way frontier labs (OpenAI, Anthropic, Google DeepMind, NVIDIA) build and interview on it. It is not a leaked or insider question — it is the standard shape of “serve a large decoder model behind a streaming API under SLOs,” and the value is in reasoning correctly about GPU mechanics, not in trivia.
Who asks this & what they probe
Anchor scenario
To keep numbers concrete I’ll anchor on a realistic deployment:
- A dense 70B-class decoder (Llama-3.1-70B shape: 80 layers, GQA with 8 KV heads, head_dim 128) plus a few 7–8B fine-tunes, served tensor-parallel TP=2 or TP=4 on H100-80GB.
- Traffic is multi-tenant and streaming: interactive chat, RAG (long prompts), and a throughput-first batch lane, all on the same fleet.
Out of scope
Pretraining and fine-tuning the model; the retrieval/RAG pipeline that produces the prompt; agent orchestration and tool execution above the API; offline eval harnesses (they’re a dependency we gate on, not a thing we build here).
Phased scope
Phase 1 is a single-model, single-region engine with continuous batching over a paged KV cache and SSE streaming. Phase 2 adds prefix caching with cache-affinity routing, FP8 weights and KV, and multi-LoRA. Phase 3 adds speculative decoding, prefill/decode disaggregation, and multi-region. The headline metric from minute one is goodput: the fraction of requests that meet both their TTFT and TPOT SLOs — a slow stream that technically completes is a miss.
Requirements
Functional requirements
- SSE streaming of token deltas (stream=true) with incremental detokenization.
- Chat templating and system prompts; function/tool calling; JSON-schema structured output (constrained decoding); logprobs.
- Cancellation: client disconnect must free the sequence's KV blocks immediately, not at end-of-generation.
- Per-tenant token quotas metered on input + output tokens.
- Multi-tenant request mixing on shared replicas with isolation.
Non-functional SLOs by traffic class
The two latency metrics are TTFT (time-to-first-token, dominated by prefill) and TPOT (time-per-output-token, the inter-token gap during decode). They trade against each other through batch size, so they need separate budgets per class.
Cross-cutting
- Multi-tenant isolation: one tenant's long-context flood must not starve another's TTFT.
- Quality parity: any FP8 / speculative-decode config must match the FP16 baseline within a defined tolerance (greedy exact-match + benchmark deltas), gated before promotion.
- Cost: a target $/Mtok the well-utilized fleet must hit.
The objective: goodput
Goodput is the objective function. Raw tokens/sec is a trap — you can maximize it by jamming the batch full, which inflates TPOT and silently fails the interactive SLO. Goodput counts only tokens delivered inside both SLO budgets, so it directly captures the throughput-vs-latency tradeoff the scheduler exists to manage.
Back-of-envelope estimation
Weights and tensor parallelism
Rule: pick the smallest TP that fits with headroom for KV. TP adds an all-reduce per layer, so over-sharding burns interconnect bandwidth and hurts decode latency. FP8 weights at TP=2 leave ~35 GB/GPU for KV after weights and activations.
KV cache per token (the real constraint)
KV/token = 2 (K and V) × layers × kv_heads × head_dim × dtype_bytes. GQA (8 KV heads, not 64 query heads) is what makes 70B servable — it shrinks KV 8x versus multi-head.
With ~35 GB free at TP=2, that is ~100k cached tokens/GPU at FP16 KV, ~210k at FP8 KV — tens of thousands of tokens, i.e. tens of concurrent long streams. KV capacity, not compute, sets your concurrency ceiling.
Roofline: why decode is bandwidth-bound
Decode generates one token per sequence per step, so arithmetic intensity is tiny — the GPU must stream the entire weight set from HBM to produce that token and is starved for data, not FLOPs. A single stream is therefore weight-bandwidth-bound: 35 GB of FP8 weights/GPU at TP=2 ÷ 3.35 TB/s caps a lone stream around tens of tok/s. Continuous batching is the escape hatch: many sequences share one weight read per step, so aggregate decode lands at ~1.5–3k output tok/s/GPU for 70B TP=2 (~20–40 tok/s/stream). Prefill is the opposite — it processes the whole prompt in parallel, is compute-bound, and saturates the tensor cores.
Throughput multipliers and cost
- Continuous batching + PagedAttention: ~2–4x throughput vs static batching, mostly by eliminating padding and fragmentation.
- Prefix caching: 60–90% reuse on shared system prompts and multi-turn history — recompute you skip entirely.
- A well-utilized 70B FP8 fleet lands at low single-digit $/Mtok.
Staff note: size the long tail, not the mean
KV pressure and tail latency are driven by the long tail of context and output length, not the average. A p50 of 1k tokens with a p99 of 100k means a handful of requests can consume the KV pool and stall everyone. Size capacity, admission control, and per-request max-token clamps to the tail.
API design
Public surface (OpenAI-compatible)
Internal gateway → router → engine RPC
API governance
- Cancellation: client disconnect (or explicit cancel) propagates to the engine and frees KV blocks immediately — the single biggest source of wasted HBM if missed.
- Idempotency keys + request IDs threaded through gateway, router, and engine for tracing and safe retries.
- Rate limiting is token-metered, not RPS: the quota contract counts input + output tokens per tenant per window. One 100k-token request costs far more than a hundred 100-token requests; RPS limits would price it wrong.
Data model
The engine is deeply stateful — the KV cache is the per-sequence state. The discipline is separating what’s ephemeral (recomputable, evictable, spillable) from what’s durable.
Per-sequence state
Global paged KV pool
Prefix-cache index
Ephemeral vs durable
The key insight: treat the KV cache as virtual memory. Block tables are page tables, eviction is paging, host-RAM offload is swap. That framing makes the whole memory subsystem fall out of familiar OS mechanics.
High-level architecture
Data path
Control plane
- Model/adapter registry — versioned weights and LoRA adapters, source of truth.
- Autoscaler — scales on queue depth + KV utilization (not CPU/RPS), keeps a warm pool to hide multi-GB cold starts.
- Weight/adapter loader — snapshot loader; swaps LoRA over PCIe.
- Observability — goodput, TTFT/TPOT, MFU, KV occupancy, cost/Mtok.
Lifecycle trace: warm-prefix multi-turn chat
1. Turn 3 of a chat arrives; gateway tokenizes and computes prefix_hash over system prompt + prior turns.
2. Router sends it to the replica that already holds that prefix’s KV blocks (cache-affinity routing).
3. Scheduler admits it; the shared prefix is a cache hit — those blocks are reused via ref-counted copy-on-write, so prefill only computes the new user turn, not the whole history.
4. Decode steps join the continuous batch; each token is sampled, detokenized, and streamed as an SSE delta.
5. On completion (or disconnect), the sequence’s non-shared blocks return to the free list; shared prefix blocks stay cached for the next turn.
The win: skipping prefill recompute on the shared prefix is often the difference between a 1.5s and a 300ms TTFT on long multi-turn chats.
Deep dives
WHERE STAFF IS WONThis is where Staff is won. I’ll go deep on four mechanisms and touch TP-vs-PP, FP8, and multi-LoRA.
Deep dive A: Continuous batching + PagedAttention
Static batching waits for the whole batch to finish the slowest sequence before starting new work — terrible utilization when output lengths vary by 100x. Continuous batching is iteration-level scheduling: at every decode step the scheduler can evict finished sequences and admit waiting ones into the running batch. The analogy is work-stealing over a shared run queue — the GPU never idles waiting on a straggler.
The enabler is PagedAttention. Naive KV caches reserve a contiguous max-length buffer per sequence, wasting 60–80% to internal fragmentation and reservation for tokens never generated. PagedAttention stores KV in fixed-size non-contiguous blocks addressed through a per-sequence block table — exactly OS paging. Benefits:
- Near-100% KV utilization: no fragmentation, allocate a page only when needed.
- Copy-on-write sharing: identical prefixes (system prompts, multi-turn history, beam-search branches) share physical blocks via ref counts; fork is cheap.
- Makes preemption clean — swap a sequence's pages to host RAM and restore later.
Scheduler loop, per step:
Trap: chasing tokens/sec by raising max_batch_size inflates per-step latency and blows the TPOT SLO — bigger batch = longer step = slower stream. Tune batch to the goodput knee, not the throughput max.
Deep dive B: Prefill vs decode — chunked prefill, then disaggregation
Prefill is compute-bound (whole prompt in parallel, saturates tensor cores); decode is bandwidth-bound (one token/seq, streams weights). Mixing them naively causes head-of-line blocking: a 100k-token prefill monopolizes a step and every decoding stream stalls — a TPOT spike for everyone.
Two fixes, in order of sophistication:
- Chunked prefill: split a long prefill into fixed token-budget chunks and piggyback ongoing decode tokens into each step. The 100k prompt is processed over many steps while decodes keep flowing, bounding TPOT. Single-cluster, low complexity.
- Prefill/decode (PD) disaggregation: run separate pools — compute-heavy prefill nodes and bandwidth-heavy decode nodes — and ship the KV cache from prefill to decode over the interconnect. Lets you scale and hardware-match each phase independently (prefill wants FLOPs, decode wants HBM bandwidth) and eliminates interference entirely. The cost is a KV transfer and a more complex topology.
Deep dive C: Prefix caching + cache-affinity routing
System prompts and multi-turn histories are reused across requests; recomputing their KV every time is pure waste. A radix-trie / hash index maps prefix → cached KV blocks, with LRU eviction. On a hit, prefill skips the cached span entirely — 60–90% reuse on shared prefixes.
But the cache only pays off if requests with the same prefix land on the replica that holds it. The router must be cache-affinity-aware: route by prefix_hash to the replica with the matching KV, balanced against that replica’s load and predicted latency.
Trap: plain round-robin or least-connections routing nukes the prefix-cache hit rate — the most common silent regression in these designs. You spent the HBM caching prefixes and then scattered requests so nothing hits.
Isolation: the prefix index is salted per tenant (hash(prefix, tenant_id)) so tenant A can never get a cache hit on tenant B’s data — a correctness and privacy requirement, not just hygiene.
Deep dive D: Speculative decoding — acceptance math + adaptive gating
Decode is bandwidth-bound, so the weight read per step is “free” capacity we’re wasting on one token. Speculative decoding spends it: a cheap draft (a small model, or EAGLE-2 / Medusa heads) proposes k tokens, and the target model verifies all k in a single parallel forward pass (parallel verify is compute, which we have to spare). Accepted tokens are kept; the first rejection truncates. It’s branch prediction: predict ahead, verify, squash on mispredict — and crucially the output distribution is provably identical to the target’s.
The economics hinge on acceptance rate α. Expected tokens per target forward ≈ (1 − α^(k+1)) / (1 − α). At ~70–80% acceptance this yields ~2–3x decode speedup (up to ~3.6x reported on H200-class hardware for friendly workloads).
Trap — it backfires at high batch. Rejected draft tokens still cost FLOPs in the verify pass. When the batch is already large, the GPU is no longer bandwidth-starved — those wasted verify-FLOPs become real cost and speculation can slow you down. So gate it adaptively: speculate aggressively at low batch / low load, scale k down or disable as batch size and acceptance-rate telemetry cross a threshold.
Touch: TP vs PP, FP8, multi-LoRA
- TP vs PP: tensor parallelism shards each layer (all-reduce per layer, latency-sensitive, needs NVLink) — best within a node for latency. Pipeline parallelism splits layers across stages (adds bubble, throughput-oriented) — for crossing nodes when the model won't fit. For a 70B on H100, TP=2/4 within a node is the answer; reach for PP only at larger models.
- FP8 weights + KV: ~2x VRAM savings and ~1.5–2x throughput at typically under 1% quality loss — but always behind a per-model eval gate. Trap: quantizing without an eval silently regresses reasoning/long-context tasks even when perplexity barely moves.
- Multi-LoRA: serve many fine-tunes on one base model by swapping rank-decomposition adapters; a rank-64 LoRA loads in ~28–45ms over PCIe, batched together via segmented kernels (e.g. SGMV) so different adapters coexist in one batch.
Multi-team rollout
Model / version rollout with quality gates
Every new model version, quantized, or speculative-decode config goes through canary + shadow traffic before promotion. Promotion gate: greedy exact-match against the FP16 baseline plus benchmark deltas (reasoning, long-context, tool-use) within tolerance. Latency wins never ship without the eval gate — that’s the whole point of pairing each optimization with a measurement.
Cold start and warm pool
The bottleneck is loading multi-GB weights — tens of seconds to pull 70–140 GB into HBM. Mitigations:
- A warm pool of pre-loaded replicas absorbs traffic spikes without paying cold start.
- A snapshot loader (memory-map / fast checkpoint format) cuts the load time.
- LoRA adapters swap in ~28–45ms, so adapter churn is cheap relative to base-model loads.
Dashboards
Overload playbook
When admission can’t meet deadlines:
- Admission shedding — reject or queue lowest-priority (batch) first.
- Max-token clamps — cap runaway generations protecting the KV pool.
- Priority preemption — swap a low-priority sequence's KV to host RAM, run the interactive request, restore later.
- Drain without dropping — when retiring a replica, stop admitting but let in-flight streams complete so no client sees a mid-stream cut.
Bottlenecks & evolution
The binding constraint: HBM
Everything traces back to HBM capacity and bandwidth. Mitigations, escalating:
- GQA (already assumed) and FP8 KV shrink the cache.
- Tiered / host-RAM KV offload (LMCache-style) for long context — spill cold blocks to CPU memory, page back on demand. KV-as-virtual-memory pays off again.
- KV compression and sequence/context parallelism for very long contexts.
How the design evolves
- Prefill stalls → chunked prefill → full PD disaggregation as scale grows.
- Long context → context/sequence parallelism + KV compression.
- Agentic and MoE traffic shift the bottleneck: MoE turns it into an expert-routing and load-balancing problem; agent loops create KV reuse across tool calls worth caching aggressively.
- Hardware moves the wall: H200/B200 bring more bandwidth and FP4, changing every roofline-driven choice — re-derive, don't assume.
The closing insight
There is no single optimal config. The right TP degree, batch size, speculation depth, and quantization are all functions of the live workload mix — prompt/output length distribution, prefix-reuse rate, tenant SLO mix. The engine must continuously re-tune to goodput as that mix shifts; a config tuned for chat will be wrong for a RAG flood an hour later.
Summary
1. The engine is a goodput-maximizing token factory bounded by HBM. Not QPS, not FLOPs — HBM capacity (which sets concurrency via the KV cache) and HBM bandwidth (which sets decode throughput) are the walls. Optimize the fraction of requests meeting both SLOs per GPU-dollar, not raw tokens/sec.
1. Prefill is compute-bound; decode is bandwidth-bound. This one fact explains continuous batching (amortize the weight read across many streams), chunked prefill / PD disaggregation (stop the two phases interfering), and speculative decoding (spend spare bandwidth-bound capacity on parallel verify). Internalizing it is the switcher’s main jump.
1. The non-negotiables: continuous batching over a paged KV cache; prefill/decode-aware SLO scheduling; prefix-cache reuse with cache-affinity routing (not round-robin); the smallest TP that fits; FP8 weights + KV behind an eval gate; and optional adaptive speculative decoding that backs off at high batch.
1. Treat the KV cache as virtual memory. Block tables are page tables, eviction is paging, host offload is swap, prefix sharing is copy-on-write. Treat the scheduler as a real-time deadline/queueing system. These two framings convert most of the problem into distributed-systems work an SDE already owns.
1. TTFT and TPOT are two SLOs that trade through batch size. Bigger batches raise throughput and per-step latency together, so chasing throughput silently fails TPOT. The scheduler’s job is to sit at the goodput knee, not the throughput max.
1. The split: the SDE owns the streaming control plane (SSE gateway with KV-freeing cancellation, KV-aware router, token-metered limits, warm-pool autoscaling, deadline scheduler). The MLE owns tokens/sec/dollar and quality (batching scheduler, paged KV, kernels, quantization recipe, TP plan, speculation) and pairs every latency win with an eval gate. The switcher maps known instincts — scheduling, caching, routing, autoscaling — onto four new ideas: prefill-vs-decode, KV-as-constraint, TTFT-vs-TPOT, and goodput. And the optimal config is workload-dependent — it must be continuously re-tuned.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.