← Back to all questions
AI System DesignStaffLLM InferenceGPU Serving

Design an LLM Inference Serving Engine

A Staff-level walkthrough of the LLM inference serving engine that OpenAI, Anthropic, and Google DeepMind run behind their chat APIs — a goodput-maximizing token factory bounded by HBM. It pins down continuous batching over a paged KV cache, prefill/decode-aware SLO scheduling, prefix-cache reuse with cache-affinity routing, FP8 weights+KV behind eval gates, and adaptive speculative decoding, optimizing TTFT/TPOT goodput per GPU-dollar rather than raw tokens/sec.

Level
Staff
Category
AI Infrastructure · LLM Serving
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Continuous (in-flight) batching over a PagedAttention KV cache, not static batching
·Prefill/decode-aware SLO scheduling: chunked prefill to bound TPOT, then PD disaggregation
·Prefix-cache reuse with cache-affinity routing for multi-turn and shared system prompts
·TTFT/TPOT as separate streaming SLOs; goodput, not raw tokens/sec, as the headline metric
★ STAFF-LEVEL SIGNALS
Sizes KV-cache budget from HBM bandwidth and derives max concurrency, not guesses
Gates FP8 weights+KV and speculative decoding behind per-model eval checks, adaptive by batch occupancy
Treats the KV cache as virtual memory: paged blocks, copy-on-write, host-memory spill under preemption
Meters rate limits in tokens (not RPS) and runs an overload playbook: admission shedding, max-token clamps, priority preemption
0

Scope & ambiguity

Let me pin down the boundary first. I’m building the GPU-resident serving engine behind /v1/chat/completions — the thing that takes a tokenized request and turns weights plus a KV cache into a stream of tokens under latency SLOs. I’m not building the model (pretraining, fine-tuning), the retrieval layer that fills the prompt, or the agent loop above it; those are upstream and I treat them as clients. The one tension that governs every decision: HBM is the scarce resource, not QPS or FLOPs, and the objective is goodput — requests served within their latency budget — per GPU-dollar, not raw tokens/sec. This is the engine OpenAI, Anthropic, and Google DeepMind each run behind their chat APIs, and it’s a system that’s ~70% distributed-systems you already know wrapped around a few GPU-internal facts you have to get right.

This is an AI-infrastructure design problem framed the way frontier labs (OpenAI, Anthropic, Google DeepMind, NVIDIA) build and interview on it. It is not a leaked or insider question — it is the standard shape of “serve a large decoder model behind a streaming API under SLOs,” and the value is in reasoning correctly about GPU mechanics, not in trivia.

Who asks this & what they probe

Role
Focus
What they probe
SDE
Streaming control plane
HTTP+SSE gateway, cancellation that frees KV on disconnect, KV-aware router, token-metered rate limits, autoscaling with warm pool, scheduler as a deadline/queue system. Gap: why decode is bandwidth-bound and batch trades throughput vs TPOT.
MLE
Tokens/sec/dollar and quality
Batching scheduler, paged KV cache, attention kernels, quantization recipe, TP plan, speculative decoding. Derives throughput from HBM roofline, sizes KV from layers×heads×head_dim, pairs every quant/spec choice with an eval gate. Optimizes MFU/goodput.
Switcher (SDE to AI)
Mapping known systems to new vocab
Whether prefill-vs-decode (compute- vs bandwidth-bound) and KV-as-the-constraint are internalized. Maps continuous batching to iteration-level scheduling, paged KV to a page table, prefix caching to locality routing. New terms: TTFT vs TPOT, goodput.

Anchor scenario

To keep numbers concrete I’ll anchor on a realistic deployment:

  • A dense 70B-class decoder (Llama-3.1-70B shape: 80 layers, GQA with 8 KV heads, head_dim 128) plus a few 7–8B fine-tunes, served tensor-parallel TP=2 or TP=4 on H100-80GB.
  • Traffic is multi-tenant and streaming: interactive chat, RAG (long prompts), and a throughput-first batch lane, all on the same fleet.

Out of scope

Pretraining and fine-tuning the model; the retrieval/RAG pipeline that produces the prompt; agent orchestration and tool execution above the API; offline eval harnesses (they’re a dependency we gate on, not a thing we build here).

Phased scope

Phase 1 is a single-model, single-region engine with continuous batching over a paged KV cache and SSE streaming. Phase 2 adds prefix caching with cache-affinity routing, FP8 weights and KV, and multi-LoRA. Phase 3 adds speculative decoding, prefill/decode disaggregation, and multi-region. The headline metric from minute one is goodput: the fraction of requests that meet both their TTFT and TPOT SLOs — a slow stream that technically completes is a miss.

1

Requirements

Functional requirements

  • SSE streaming of token deltas (stream=true) with incremental detokenization.
  • Chat templating and system prompts; function/tool calling; JSON-schema structured output (constrained decoding); logprobs.
  • Cancellation: client disconnect must free the sequence's KV blocks immediately, not at end-of-generation.
  • Per-tenant token quotas metered on input + output tokens.
  • Multi-tenant request mixing on shared replicas with isolation.

Non-functional SLOs by traffic class

The two latency metrics are TTFT (time-to-first-token, dominated by prefill) and TPOT (time-per-output-token, the inter-token gap during decode). They trade against each other through batch size, so they need separate budgets per class.

Class
TTFT p99
TPOT p99
Notes
Interactive chat
500ms–1s
15–50ms
20–60 tok/s/stream; the SLO-critical lane
RAG
1–3s
30–50ms
Long prompts, TTFT-sensitive (prefill heavy)
Batch
seconds OK
lax
Throughput-first; backfills idle GPU

Cross-cutting

  • Multi-tenant isolation: one tenant's long-context flood must not starve another's TTFT.
  • Quality parity: any FP8 / speculative-decode config must match the FP16 baseline within a defined tolerance (greedy exact-match + benchmark deltas), gated before promotion.
  • Cost: a target $/Mtok the well-utilized fleet must hit.

The objective: goodput

Goodput is the objective function. Raw tokens/sec is a trap — you can maximize it by jamming the batch full, which inflates TPOT and silently fails the interactive SLO. Goodput counts only tokens delivered inside both SLO budgets, so it directly captures the throughput-vs-latency tradeoff the scheduler exists to manage.

2

Back-of-envelope estimation

Weights and tensor parallelism

Precision
70B weights
Fits H100-80GB?
FP16
~140 GB
No — needs TP=2 min
FP8
~70 GB
TP=2 comfortable; TP=4 spacious

Rule: pick the smallest TP that fits with headroom for KV. TP adds an all-reduce per layer, so over-sharding burns interconnect bandwidth and hurts decode latency. FP8 weights at TP=2 leave ~35 GB/GPU for KV after weights and activations.

KV cache per token (the real constraint)

KV/token = 2 (K and V) × layers × kv_heads × head_dim × dtype_bytes. GQA (8 KV heads, not 64 query heads) is what makes 70B servable — it shrinks KV 8x versus multi-head.

Config
KV per token
Per 8k-token request
70B GQA, FP16 KV
~0.31 MB
~2.5 GB
70B GQA, FP8 KV
~0.16 MB
~1.25 GB

With ~35 GB free at TP=2, that is ~100k cached tokens/GPU at FP16 KV, ~210k at FP8 KV — tens of thousands of tokens, i.e. tens of concurrent long streams. KV capacity, not compute, sets your concurrency ceiling.

Roofline: why decode is bandwidth-bound

H100 spec
Value
BF16 compute
~989 TFLOP/s
HBM3 bandwidth
~3.35 TB/s, 80 GB

Decode generates one token per sequence per step, so arithmetic intensity is tiny — the GPU must stream the entire weight set from HBM to produce that token and is starved for data, not FLOPs. A single stream is therefore weight-bandwidth-bound: 35 GB of FP8 weights/GPU at TP=2 ÷ 3.35 TB/s caps a lone stream around tens of tok/s. Continuous batching is the escape hatch: many sequences share one weight read per step, so aggregate decode lands at ~1.5–3k output tok/s/GPU for 70B TP=2 (~20–40 tok/s/stream). Prefill is the opposite — it processes the whole prompt in parallel, is compute-bound, and saturates the tensor cores.

Throughput multipliers and cost

  • Continuous batching + PagedAttention: ~2–4x throughput vs static batching, mostly by eliminating padding and fragmentation.
  • Prefix caching: 60–90% reuse on shared system prompts and multi-turn history — recompute you skip entirely.
  • A well-utilized 70B FP8 fleet lands at low single-digit $/Mtok.

Staff note: size the long tail, not the mean

KV pressure and tail latency are driven by the long tail of context and output length, not the average. A p50 of 1k tokens with a p99 of 100k means a handful of requests can consume the KV pool and stall everyone. Size capacity, admission control, and per-request max-token clamps to the tail.

3

API design

Public surface (OpenAI-compatible)

POST /v1/chat/completions
{
"model": "chat-70b",
"messages": [{"role":"system",...},{"role":"user",...}],
"stream": true,
"max_tokens": 1024,
"temperature": 0.7, "top_p": 0.95, "seed": 42,
"stop": ["</done>"],
"logprobs": true, "top_logprobs": 5,
"tools": [...], "tool_choice": "auto",
"response_format": {"type":"json_schema","json_schema":{...}}
}
 
# stream=true -> text/event-stream (SSE)
data: {"choices":[{"delta":{"content":"Hel"}}]}
data: {"choices":[{"delta":{"content":"lo"}}]}
data: {"choices":[{"finish_reason":"stop"}]}
data: [DONE]

Internal gateway → router → engine RPC

EngineRequest {
request_id: string // trace + dedup key
tenant_id: string
token_ids: [int] // pre-tokenized at the gateway
sampling: {temperature, top_p, stop_ids, max_tokens,
logprobs, seed, json_schema_fsm?}
adapter_id: string? // LoRA selection
priority: enum {interactive, rag, batch}
deadline_ms: int // derived from class SLO
prefix_hash: int64 // for cache-affinity routing
}
stream EngineEvent { token_id, logprob?, finish_reason? }

API governance

  • Cancellation: client disconnect (or explicit cancel) propagates to the engine and frees KV blocks immediately — the single biggest source of wasted HBM if missed.
  • Idempotency keys + request IDs threaded through gateway, router, and engine for tracing and safe retries.
  • Rate limiting is token-metered, not RPS: the quota contract counts input + output tokens per tenant per window. One 100k-token request costs far more than a hundred 100-token requests; RPS limits would price it wrong.
4

Data model

The engine is deeply stateful — the KV cache is the per-sequence state. The discipline is separating what’s ephemeral (recomputable, evictable, spillable) from what’s durable.

Per-sequence state

Sequence {
token_ids: [int] // prompt + generated so far
sampling_params, deadline, adapter_id, priority
block_table: [block_id] // KV pages, like a page table's PTEs
num_computed_tokens: int // prefill progress (for chunking)
position: int
status: {waiting, running, swapped, finished}
}

Global paged KV pool

KVBlockPool (per GPU) {
blocks: contiguous HBM, fixed-size pages (e.g. 16 tokens)
free_list: [block_id]
ref_count[block_id]: int // copy-on-write for shared prefixes
allocator: alloc()/free() // O(1), no compaction needed
}

Prefix-cache index

PrefixCache {
radix_trie / hash: prefix_token_hash -> [block_id]
lru: eviction order
tenant_salt: hash(prefix, tenant_id) // isolation: no cross-
// tenant cache hits
}

Ephemeral vs durable

State
Class
Policy
KV blocks
Ephemeral
Evict (LRU), recompute, or swap to host RAM
Prefix cache
Ephemeral
LRU, ref-counted, tenant-salted
Model weights
Durable
Loaded once, pinned in HBM
LoRA adapters
Durable
Registry; hot-swapped on demand
Model/adapter registry
Durable
Versioned, source of truth

The key insight: treat the KV cache as virtual memory. Block tables are page tables, eviction is paging, host-RAM offload is swap. That framing makes the whole memory subsystem fall out of familiar OS mechanics.

5

High-level architecture

Data path

client
│ HTTP + SSE
Gateway ── auth, token rate-limit, tokenize, chat-template
Router ── KV-cache-aware + predicted-latency selection
│ (route by prefix_hash to the replica that has it)
Engine replica (one per TP shard group)
├─ Scheduler ── admission, prefill/decode mix, chunked
│ prefill, preemption, deadline ordering
├─ Executor ── forward pass over TP/PP shards
├─ Sampler ── temperature/top-p, logprobs, JSON-schema FSM
└─ Detokenizer ── incremental, streamed back as SSE deltas

Control plane

  • Model/adapter registry — versioned weights and LoRA adapters, source of truth.
  • Autoscaler — scales on queue depth + KV utilization (not CPU/RPS), keeps a warm pool to hide multi-GB cold starts.
  • Weight/adapter loader — snapshot loader; swaps LoRA over PCIe.
  • Observability — goodput, TTFT/TPOT, MFU, KV occupancy, cost/Mtok.

Lifecycle trace: warm-prefix multi-turn chat

1. Turn 3 of a chat arrives; gateway tokenizes and computes prefix_hash over system prompt + prior turns.

2. Router sends it to the replica that already holds that prefix’s KV blocks (cache-affinity routing).

3. Scheduler admits it; the shared prefix is a cache hit — those blocks are reused via ref-counted copy-on-write, so prefill only computes the new user turn, not the whole history.

4. Decode steps join the continuous batch; each token is sampled, detokenized, and streamed as an SSE delta.

5. On completion (or disconnect), the sequence’s non-shared blocks return to the free list; shared prefix blocks stay cached for the next turn.

The win: skipping prefill recompute on the shared prefix is often the difference between a 1.5s and a 300ms TTFT on long multi-turn chats.

6

Deep dives

WHERE STAFF IS WON

This is where Staff is won. I’ll go deep on four mechanisms and touch TP-vs-PP, FP8, and multi-LoRA.

Deep dive A: Continuous batching + PagedAttention

Static batching waits for the whole batch to finish the slowest sequence before starting new work — terrible utilization when output lengths vary by 100x. Continuous batching is iteration-level scheduling: at every decode step the scheduler can evict finished sequences and admit waiting ones into the running batch. The analogy is work-stealing over a shared run queue — the GPU never idles waiting on a straggler.

The enabler is PagedAttention. Naive KV caches reserve a contiguous max-length buffer per sequence, wasting 60–80% to internal fragmentation and reservation for tokens never generated. PagedAttention stores KV in fixed-size non-contiguous blocks addressed through a per-sequence block table — exactly OS paging. Benefits:

  • Near-100% KV utilization: no fragmentation, allocate a page only when needed.
  • Copy-on-write sharing: identical prefixes (system prompts, multi-turn history, beam-search branches) share physical blocks via ref counts; fork is cheap.
  • Makes preemption clean — swap a sequence's pages to host RAM and restore later.

Scheduler loop, per step:

loop forever:
free_finished() # reclaim KV pages
budget = max_batch_tokens # token budget per step
# admit prefills (chunked) under budget, deadline-ordered
for req in waiting.by_deadline():
if can_allocate_kv(req): schedule_chunk(req, budget)
# piggyback decode for all running seqs
for seq in running: schedule_decode(seq)
if kv_pressure_high(): preempt_lowest_priority() # swap/recompute
out = executor.forward(batch) # one fused forward pass
sampler.sample(out); detok.stream(out)

Trap: chasing tokens/sec by raising max_batch_size inflates per-step latency and blows the TPOT SLO — bigger batch = longer step = slower stream. Tune batch to the goodput knee, not the throughput max.

Deep dive B: Prefill vs decode — chunked prefill, then disaggregation

Prefill is compute-bound (whole prompt in parallel, saturates tensor cores); decode is bandwidth-bound (one token/seq, streams weights). Mixing them naively causes head-of-line blocking: a 100k-token prefill monopolizes a step and every decoding stream stalls — a TPOT spike for everyone.

Two fixes, in order of sophistication:

  • Chunked prefill: split a long prefill into fixed token-budget chunks and piggyback ongoing decode tokens into each step. The 100k prompt is processed over many steps while decodes keep flowing, bounding TPOT. Single-cluster, low complexity.
  • Prefill/decode (PD) disaggregation: run separate pools — compute-heavy prefill nodes and bandwidth-heavy decode nodes — and ship the KV cache from prefill to decode over the interconnect. Lets you scale and hardware-match each phase independently (prefill wants FLOPs, decode wants HBM bandwidth) and eliminates interference entirely. The cost is a KV transfer and a more complex topology.
Chunked prefill
PD disaggregation
Interference
Bounded
Eliminated
KV transfer
None
Prefill→decode hop
Scaling
Coupled
Independent pools
Complexity
Low
High

Deep dive C: Prefix caching + cache-affinity routing

System prompts and multi-turn histories are reused across requests; recomputing their KV every time is pure waste. A radix-trie / hash index maps prefix → cached KV blocks, with LRU eviction. On a hit, prefill skips the cached span entirely — 60–90% reuse on shared prefixes.

But the cache only pays off if requests with the same prefix land on the replica that holds it. The router must be cache-affinity-aware: route by prefix_hash to the replica with the matching KV, balanced against that replica’s load and predicted latency.

Trap: plain round-robin or least-connections routing nukes the prefix-cache hit rate — the most common silent regression in these designs. You spent the HBM caching prefixes and then scattered requests so nothing hits.

Isolation: the prefix index is salted per tenant (hash(prefix, tenant_id)) so tenant A can never get a cache hit on tenant B’s data — a correctness and privacy requirement, not just hygiene.

Deep dive D: Speculative decoding — acceptance math + adaptive gating

Decode is bandwidth-bound, so the weight read per step is “free” capacity we’re wasting on one token. Speculative decoding spends it: a cheap draft (a small model, or EAGLE-2 / Medusa heads) proposes k tokens, and the target model verifies all k in a single parallel forward pass (parallel verify is compute, which we have to spare). Accepted tokens are kept; the first rejection truncates. It’s branch prediction: predict ahead, verify, squash on mispredict — and crucially the output distribution is provably identical to the target’s.

The economics hinge on acceptance rate α. Expected tokens per target forward ≈ (1 − α^(k+1)) / (1 − α). At ~70–80% acceptance this yields ~2–3x decode speedup (up to ~3.6x reported on H200-class hardware for friendly workloads).

Trap — it backfires at high batch. Rejected draft tokens still cost FLOPs in the verify pass. When the batch is already large, the GPU is no longer bandwidth-starved — those wasted verify-FLOPs become real cost and speculation can slow you down. So gate it adaptively: speculate aggressively at low batch / low load, scale k down or disable as batch size and acceptance-rate telemetry cross a threshold.

Touch: TP vs PP, FP8, multi-LoRA

  • TP vs PP: tensor parallelism shards each layer (all-reduce per layer, latency-sensitive, needs NVLink) — best within a node for latency. Pipeline parallelism splits layers across stages (adds bubble, throughput-oriented) — for crossing nodes when the model won't fit. For a 70B on H100, TP=2/4 within a node is the answer; reach for PP only at larger models.
  • FP8 weights + KV: ~2x VRAM savings and ~1.5–2x throughput at typically under 1% quality loss — but always behind a per-model eval gate. Trap: quantizing without an eval silently regresses reasoning/long-context tasks even when perplexity barely moves.
  • Multi-LoRA: serve many fine-tunes on one base model by swapping rank-decomposition adapters; a rank-64 LoRA loads in ~28–45ms over PCIe, batched together via segmented kernels (e.g. SGMV) so different adapters coexist in one batch.
7

Multi-team rollout

Model / version rollout with quality gates

Every new model version, quantized, or speculative-decode config goes through canary + shadow traffic before promotion. Promotion gate: greedy exact-match against the FP16 baseline plus benchmark deltas (reasoning, long-context, tool-use) within tolerance. Latency wins never ship without the eval gate — that’s the whole point of pairing each optimization with a measurement.

Cold start and warm pool

The bottleneck is loading multi-GB weightstens of seconds to pull 70–140 GB into HBM. Mitigations:

  • A warm pool of pre-loaded replicas absorbs traffic spikes without paying cold start.
  • A snapshot loader (memory-map / fast checkpoint format) cuts the load time.
  • LoRA adapters swap in ~28–45ms, so adapter churn is cheap relative to base-model loads.

Dashboards

Metric
Healthy target
Goodput
Primary SLO metric, per class
TTFT / TPOT p99
Per class, per tenant
MFU
~40–50% is strong for decode
KV occupancy
High but with headroom for tail
Cost/Mtok
Per tenant, trending to target

Overload playbook

When admission can’t meet deadlines:

  • Admission shedding — reject or queue lowest-priority (batch) first.
  • Max-token clamps — cap runaway generations protecting the KV pool.
  • Priority preemption — swap a low-priority sequence's KV to host RAM, run the interactive request, restore later.
  • Drain without dropping — when retiring a replica, stop admitting but let in-flight streams complete so no client sees a mid-stream cut.
8

Bottlenecks & evolution

The binding constraint: HBM

Everything traces back to HBM capacity and bandwidth. Mitigations, escalating:

  • GQA (already assumed) and FP8 KV shrink the cache.
  • Tiered / host-RAM KV offload (LMCache-style) for long context — spill cold blocks to CPU memory, page back on demand. KV-as-virtual-memory pays off again.
  • KV compression and sequence/context parallelism for very long contexts.

How the design evolves

  • Prefill stalls → chunked prefill → full PD disaggregation as scale grows.
  • Long context → context/sequence parallelism + KV compression.
  • Agentic and MoE traffic shift the bottleneck: MoE turns it into an expert-routing and load-balancing problem; agent loops create KV reuse across tool calls worth caching aggressively.
  • Hardware moves the wall: H200/B200 bring more bandwidth and FP4, changing every roofline-driven choice — re-derive, don't assume.

The closing insight

There is no single optimal config. The right TP degree, batch size, speculation depth, and quantization are all functions of the live workload mix — prompt/output length distribution, prefix-reuse rate, tenant SLO mix. The engine must continuously re-tune to goodput as that mix shifts; a config tuned for chat will be wrong for a RAG flood an hour later.

Summary

1. The engine is a goodput-maximizing token factory bounded by HBM. Not QPS, not FLOPs — HBM capacity (which sets concurrency via the KV cache) and HBM bandwidth (which sets decode throughput) are the walls. Optimize the fraction of requests meeting both SLOs per GPU-dollar, not raw tokens/sec.

1. Prefill is compute-bound; decode is bandwidth-bound. This one fact explains continuous batching (amortize the weight read across many streams), chunked prefill / PD disaggregation (stop the two phases interfering), and speculative decoding (spend spare bandwidth-bound capacity on parallel verify). Internalizing it is the switcher’s main jump.

1. The non-negotiables: continuous batching over a paged KV cache; prefill/decode-aware SLO scheduling; prefix-cache reuse with cache-affinity routing (not round-robin); the smallest TP that fits; FP8 weights + KV behind an eval gate; and optional adaptive speculative decoding that backs off at high batch.

1. Treat the KV cache as virtual memory. Block tables are page tables, eviction is paging, host offload is swap, prefix sharing is copy-on-write. Treat the scheduler as a real-time deadline/queueing system. These two framings convert most of the problem into distributed-systems work an SDE already owns.

1. TTFT and TPOT are two SLOs that trade through batch size. Bigger batches raise throughput and per-step latency together, so chasing throughput silently fails TPOT. The scheduler’s job is to sit at the goodput knee, not the throughput max.

1. The split: the SDE owns the streaming control plane (SSE gateway with KV-freeing cancellation, KV-aware router, token-metered limits, warm-pool autoscaling, deadline scheduler). The MLE owns tokens/sec/dollar and quality (batching scheduler, paged KV, kernels, quantization recipe, TP plan, speculation) and pairs every latency win with an eval gate. The switcher maps known instincts — scheduling, caching, routing, autoscaling — onto four new ideas: prefill-vs-decode, KV-as-constraint, TTFT-vs-TPOT, and goodput. And the optimal config is workload-dependent — it must be continuously re-tuned.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Batching & scheduling
Batches requests and names dynamic batching.
Designs iteration-level continuous batching, caps batch size to hold TPOT, and runs deadline-aware admission control for goodput.
KV cache management
Knows the KV cache holds attention state and grows with context.
Treats it as a page table: PagedAttention blocks, copy-on-write for parallel sampling, LRU eviction, host-memory spill under preemption.
Prefill vs decode
Mentions a long prompt is slower than generation.
Separates compute-bound prefill from bandwidth-bound decode; uses chunked prefill then PD disaggregation with RDMA/NVLink KV hand-off.
SLO objective
Tracks p99 latency.
Splits TTFT vs TPOT, optimizes goodput (fraction meeting both), and sheds/down-prioritizes requests that cannot meet SLO.
Prefix caching & routing
Caches responses.
Hashes prefixes to KV blocks (RadixAttention), routes by cache affinity + predicted latency, and salts caches per tenant for isolation.
Quantization & parallelism
Mentions FP16 and multi-GPU.
Defaults FP8 weights+KV behind eval gates, INT4 for memory-starved tiers, smallest TP that fits intra-node, replicas for QPS.
Speculative decoding
Knows it speeds up generation.
Uses EAGLE/Medusa vs draft model, reasons acceptance-rate math, and gates it adaptively because it hurts at high batch occupancy.
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →