AI System DesignStaffGPU SchedulingMulti-Tenancy

Design a Fractional / Multi-Tenant GPU Sharing Platform

A GPU control plane is what companies like NVIDIA (Run:ai), CoreWeave, Lambda, and Together AI build and interview on: take a fleet of expensive accelerators that typically sit at ~15-20% real utilization, and drive it up by spatially partitioning GPUs and packing many tenants onto them — without one job's OOM or runaway kernel starving its neighbor. The Staff bar is reasoning about the gang-scheduling-vs-fragmentation tension, when MIG's hardware isolation is worth its throughput tax versus cheap-but-leaky MPS, and how to preempt a 500-GPU training run for a latency-critical inference burst without losing days of work.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Picks the right partitioning primitive per workload: MIG for hard multi-tenant isolation, MPS for trusted co-located inference, time-slicing only for dev/bursty — and knows the throughput tax of each

·Designs gang / all-or-nothing scheduling for distributed jobs with minMember/PodGroup semantics and explicit deadlock + fragmentation avoidance

·Handles preemption of training for latency-critical inference: priority classes, checkpoint-then-evict, bounded blast radius, and quota borrowing that's reclaimable

·Makes placement topology-aware: keeps a TP group inside one NVLink island (900 GB/s) rather than spanning nodes over ~350 GB/s effective IB all-reduce

★ STAFF-LEVEL SIGNALS

★Quantifies the utilization gap (perceived ~60% vs real ~14-20%) and ties every design choice back to driving allocatable utilization up without hurting p99 or MFU

★Names the noisy-neighbor failure modes precisely — MPS lacks fault isolation so one client's fault/OOM can crash co-tenants; time-slicing isolates neither memory nor faults — and isolates accordingly

★Treats MIG reconfiguration as a first-class cost: profiles are set at the GPU level, draining + re-partitioning is disruptive, so geometry churn is a scheduling decision not a free knob

★Defines the right SLI: allocatable-vs-used and SM_ACTIVE/MFU, not the misleading nvidia-smi 'GPU util %' that reads 100% for a single tiny kernel

Frame — the utilization crisis & who's asking

“A $30k/yr GPU at 15% utilization is ~$25k/yr of waste — the whole job is to safely raise that number without breaking inference p99.”

GPU sharing is a control-plane problem dressed up as a hardware problem. The fleet is the most expensive line item in the company, and most of it is idle most of the time. The job is to drive real utilization up by spatially partitioning GPUs and packing many tenants onto them — while guaranteeing that one tenant’s OOM, runaway kernel, or 3 a.m. training crash never starves the inference replica sitting next to it.

The numbers that motivate the whole design. Most GPU clusters run at roughly 14–20% average real utilization while the teams who own them perceive something closer to 60%. Industry surveys (e.g. the State of AI Infra reports) find that ~83% of organizations admit they underutilize their AI hardware. The gap between perceived and real is the prize: closing it is worth millions per year on a large fleet.

Who asks this & what they probe

Role

What they own here

What the interviewer probes

SDE

The control plane: quota accountant, gang scheduler, bin-packer, preemptor

Correctness under contention — no partial gangs, no quota leaks, no deadlock, bounded preemption blast radius

MLE

How job shapes map to hardware

A 7B replica wants a 1g.10gb slice; TP-8 wants one NVLink island; the FLOPS cost of fractioning; SLO classes

Switcher (SDE → AI)

What transfers from Borg/K8s vs the new GPU physics

Bin-packing/quotas transfer; MIG geometry, gang semantics, NVLink topology, checkpoint-preempt are new

Lead with what transfers, then go deep on the GPU physics. A switcher should say: “I know container scheduling, hierarchical quota, and bin-packing from Borg. The new surface is that a GPU is not a bag of CPU millicores — it has MIG slice geometry (7 compute + 8 memory units, not arbitrary fractions), gang/all-or-nothing semantics, NVLink topology that matters an order of magnitude for placement, and preempt-with-checkpoint instead of just killing a pod.”

SLO classes drive every policy

Define three classes up front, because they determine priority and preemption:

1. Latency-critical inference — has a p99 SLO, never preemptible, may cause preemption.

2. Best-effort / interactive — notebooks, dev, evals; tolerates queuing.

3. Preemptible batch training — long-running, checkpointable, first to be preempted.

The central tension

Four forces pull against each other, and naming them up front is the Staff signal:

Utilization wants tight packing and aggressive GPU sharing.
Isolation wants tenants kept apart (no noisy neighbor).
Gang feasibility wants large contiguous, topology-aligned blocks free at once.
SLO wants the ability to preempt fast for an inference burst.

Every design knob below moves these four. There is no globally correct setting — only the right setting for a given tenant-trust and workload mix.

Scope. This is the control plane + scheduler. It is not the serving engine (batching, KV-cache, speculative decoding — that’s a model-serving design) and not a generic CPU/memory container scheduler (that’s a Borg/task-scheduler design). We assume containers, networking, and storage exist; we own which accelerator each job lands on and when.

Requirements & scale envelope

Functional requirements

Submit jobs of arbitrary GPU shape: a fraction of a GPU, 1, 8, or 256 GPUs.
Per-team quota with borrowing of idle capacity above quota.
Gang admission: all-or-nothing for distributed jobs.
Preempt-for-inference: reclaim training capacity for latency-critical bursts.
Topology constraints: keep a tensor-parallel group on one NVLink island.
Observability: surface real utilization and fragmentation per tenant.

Non-functional requirements

Scheduling decision latency: inference admission p99 under ~1–2s; batch may queue for minutes.
Fairness across tenants; no single tenant monopolizes borrowed idle capacity.
No partial gangs; bounded preemption blast radius.
HA control plane: a scheduler outage must not kill running jobs.

Scale envelope

Dimension

Value (worked example)

Cluster size

~4,000 GPUs = 500 × 8-GPU H100 SXM nodes

Tenants

Hundreds of teams/projects

Job mix

Thousands of small inference replicas + tens of large training gangs

Largest gang

256–512 GPUs spanning many nodes over InfiniBand

Smallest job

1 MIG slice (a fraction of one GPU)

Inference admit latency

p99 under ~1–2s

Job-shape examples

Workload

Shape

Hardware ask

7B inference replica

Fits in one slice

1g.10gb or 2g.20gb MIG on H100

70B inference

Tensor-parallel

TP-4/8 on one NVLink node

Pre-training gang

64–512 GPUs

Many nodes, rail-aligned IB

Dev / eval notebook

Bursty, low duty cycle

Time-sliced share

Fractioning is not free. Splitting a GPU into 7 MIG slices yields ~7 isolated instances, but each gets only a fraction of the SMs and a fraction of HBM bandwidth. For bandwidth-bound or large-kernel workloads, aggregate throughput across the 7 slices is less than one undivided GPU. Fractioning buys isolation and packing density, and it pays a throughput tax — quantify the tax per workload rather than assuming sharing is always a win.

Explicitly out of scope: the inference batching / KV-cache engine, model-weights storage, and the data pipeline.

Estimation — partitioning primitives: MIG vs MPS vs time-slicing

This is the first irreversible choice, so we estimate the cost of each primitive before picking. There are exactly three ways to put more than one job on a single physical GPU.

Primitive

Mechanism

Granularity

Memory isolation

Fault isolation

Throughput tax

Noisy neighbor

MIG

HW-partitioned SMs + memory

Fixed slice profiles

Strong (hardware)

Per-slice fraction of SM/BW

Near-zero

MPS

Spatial, shared context

Per-process %

Address-space sep. only

None

Low (no ctx-switch)

Real — shared fault

Time-slice

Round-robin temporal

Whole GPU in turns

None

Context-switch overhead

Severe

MIG (Multi-Instance GPU) hardware-partitions one GPU into independent instances, each with dedicated SMs, L2 slices, and memory channels. On an A100-40GB the budget is 7 compute slices + 8 memory slices, exposed as profiles 1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb. H100-80GB is analogous (1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb). Because the partition is in silicon, an OOM or a runaway kernel in one instance cannot touch another — the strongest isolation, near-zero noisy neighbor.

MPS (Multi-Process Service) lets multiple processes submit kernels to one GPU concurrently with no context-switch overhead — great for packing many small kernels that each underfill the GPU. On Volta+ each client gets its own GPU address space (some memory protection), but there is no fault isolation: one client’s fatal fault or runaway can crash co-tenants sharing the MPS daemon. Use it only for trusted, same-team inference.

Time-slicing round-robins the whole GPU between processes in time. It isolates neither memory nor faults and is pure oversubscription. It’s fine for dev/CI where bursty, low-duty-cycle jobs rarely collide — and dangerous for anything else.

When to use which

MIG → hard multi-tenant isolation: different teams sharing one GPU, or any untrusted co-tenancy.
MPS → trusted, co-located inference that needs many concurrent small kernels (same team, same trust domain).
Time-slicing → dev / CI / interactive only.

Staff nuance — MIG geometry is a scheduling cost, not a free knob. A MIG layout is set at the GPU level, and changing it requires draining every instance on that GPU and re-partitioning — disruptive to anything running. So “re-geometry this GPU from 7×1g to 1×7g” is a scheduled operation with a cost, not something the bin-packer does per job. And not all geometries are valid: you can’t freely mix arbitrary slice sizes; the scheduler must pick from the supported profile layouts for that GPU.

API design — scheduler architecture & quota/borrowing

The request contract

A job declares its shape and class; the control plane decides where and when. With Dynamic Resource Allocation (DRA), GA in Kubernetes 1.34, a job can request structured resources instead of an opaque integer GPU count:

apiVersion: scheduling/v1

kind: GpuJob

metadata: { name: llama70b-tp8, namespace: team-search }

spec:

sloClass: latency-critical-inference # never preemptible

priorityClass: inference-high

podGroup:

minMember: 8 # gang: all-or-nothing

resourceClaim:

gpus: 8

constraints:

sameNVLinkClique: true # TP must stay on one island

minMemoryPerGpuGB: 80

partition: full # not a MIG slice

preemptionPolicy: { preemptible: false }

A small inference replica instead asks for a slice. The replica itself is never preemptible, but any capacity it holds over its quota (borrowed from idle fleet) is reclaimable from it:

resourceClaim:

gpus: 1

constraints: { migProfile: "1g.10gb" }

preemptionPolicy: { preemptible: false } # job not preemptible;

# borrowed capacity is reclaimable

Build vs buy

The mainstream 2026 answer is Kubernetes-native: a gang scheduler (Volcano, Kueue, or NVIDIA’s KAI Scheduler) plus DRA for structured GPU claims. A hyperscaler with extreme scale may run a Borg-like custom scheduler instead. We’ll describe the K8s+DRA shape; the concepts port to custom.

Components

Job API / CRD

Admission + Quota controller <-- hierarchical quota, borrowing

Scheduler ( gang plug-in

+ topology-aware bin-packer

+ preemptor )

DRA driver / device plugin <-- structured claims -> devices

Node agent <-- configures MIG, runs MPS

daemon, binds slices to pods

Hierarchical quota & borrowing

Quota is a tree: org → team → project. Each node has a guaranteed quota (always honored) plus an over-quota borrowing weight (share of idle fleet capacity it may temporarily use).

Concept

Rule

Guaranteed quota

Always schedulable for the owner; never preempted away

Borrowing

Use idle capacity above quota when the fleet is slack

Reclaim

Borrowed capacity is preempted first when the owner returns

Fair-share

Time-based fairshare prevents one tenant hoarding idle capacity

The key invariant: borrowed capacity is reclaimable. When the rightful owner submits, the scheduler preempts the borrower — so borrowing is always safe to grant and never a liability.

Bin-packing objective

Minimize GPU fragmentation — don’t leave a 6g hole no job fits into — while honoring topology. These two goals conflict: the tightest pack may scatter a gang across nodes, and the most topology-clean placement may strand capacity. State the tradeoff explicitly; it’s a tunable objective, not a fixed rule.

DRA is the enabler here: a job can say “4 H100 in the same NVLink clique, ≥80GB HBM each” as a structured claim rather than “4 GPUs,” so the scheduler reasons about topology natively.

Data model — gang scheduling for distributed jobs

Why all-or-nothing

A tensor-parallel or data-parallel training step needs every rank running simultaneously — rank 3 can’t start the all-reduce until ranks 0–N are alive. If the scheduler admits 200 of 256 pods, those 200 GPUs sit idle and reserved, burning money while their pods spin waiting for the missing 56. Gang scheduling prevents exactly this: nothing starts until the whole group can.

PodGroup / minMember

Volcano, Kueue, and KAI model this with a PodGroup carrying minMember:

apiVersion: scheduling.volcano.sh/v1beta1

kind: PodGroup

metadata: { name: pretrain-256 }

spec:

minMember: 256 # none start until 256 slots reservable

minResources: # reserve to avoid driver/worker deadlock

nvidia.com/gpu: 256

queue: team-foundation

minMember: 256 means the scheduler reserves 256 slots atomically; if it can’t, zero pods bind and the gang waits as a unit.

Failure modes to design against

Driver/worker deadlock. A classic Spark-style trap: a coordinator pod grabs a GPU and waits for workers it can never schedule because it is holding the last slot. Fix: reserve minResources for the whole group so partial acquisition can't happen, plus a timeout to release a gang stuck reserving.
Fragmentation starvation. Many small jobs scatter across nodes and leave no contiguous, topology-aligned block for a 256-GPU gang — so large gangs starve indefinitely even though total free GPUs exceed 256. Mitigate with reservation + backfill and periodic defragmentation.
Backfill. While accumulating slots for a big gang, run short, evictable best-effort jobs in the gaps so the reserved-but-not-yet-full capacity still does useful work — then evict them the instant the gang can land.

minMember alone is necessary but not sufficient: the 256 must land on the right NVLink/IB topology, which is the next step.

High-level architecture — topology-aware placement & NVLink islands

The bandwidth hierarchy

Placement matters because the interconnect is wildly non-uniform:

Link

Bandwidth (per GPU)

Scope

H100 NVLink-4 via NVSwitch

~900 GB/s bidirectional

Within an 8-GPU node (non-blocking)

InfiniBand NDR 400G

~50 GB/s per 400G NIC (one NIC/GPU)

Across nodes

That’s roughly an order-of-magnitude (~15–18x) gap per GPU. Where a job’s GPUs sit changes its throughput dramatically.

Placement rules

Tensor-parallel (TP) groups must stay inside one NVLink island (one node). TP exchanges activations every layer — it is the most communication-heavy parallelism. Spanning a TP group across nodes drops it from ~900 GB/s onto ~50 GB/s per-GPU IB and tanks step throughput.
Data-parallel (DP) replicas can span nodes. DP syncs gradients once per step (less frequent), so the cross-node penalty is tolerable.
For a 70B TP-8 job: pin all 8 ranks to one node's NVSwitch domain. For a DP-over-TP job: each TP-8 group on its own node, DP across nodes over IB.

How the scheduler knows the topology

DRA + GPU Feature Discovery labels nodes with nvidia.com/gpu.clique (NVLink Domain + Clique ID). The scheduler uses node affinity / structured claims to keep a gang within one clique. For DP gangs that must cross nodes, use rail-optimized placement: keep each cross-node all-reduce on the same IB rail / leaf switch so traffic avoids oversubscribed spine hops.

The tradeoff to name

Strict topology constraints reduce the set of valid placements, which can lower utilization and raise queue time — sometimes a gang waits for the right island while the wrong islands sit idle. Packing density vs communication efficiency is a real dial, and the right setting depends on how comm-bound the workload is.

MIG ≠ NVLink. MIG instances do not participate in NVLink P2P the way full GPUs do. So MIG is for independent small jobs, never for a TP group that needs NVLink — a TP group needs whole GPUs on one island.

Deep dive — preemption, isolation & noisy-neighbor defense

WHERE STAFF IS WON

This is where Staff is won. Two hard problems live here: preempting a giant training run for an inference burst without losing days of work, and choosing the isolation tier that contains the right blast radius per tenant.

Preempt training for inference — checkpoint-then-evict

When a latency-critical inference burst arrives and the fleet is full of preemptible training, the naive move — kill a training pod — throws away hours or days of progress. The correct flow:

inference burst (priority=high) needs N GPUs

1. select victims -> prefer borrowers, then lowest-priority,

smallest-sufficient training gang

2. signal training -> "checkpoint now" (cooperative)

3. wait for checkpoint (bounded grace period)

4. evict gang, free GPUs

5. schedule inference

...

6. later: training resumes FROM CHECKPOINT when capacity returns

The difference between checkpoint-then-evict and kill is the difference between losing minutes (the time since the last checkpoint) and losing days. A grace period bounds how long we wait for the checkpoint before forcing eviction.

Bound the blast radius

The cardinal sin is evicting a whole 256-GPU training gang to free one MIG slice for one inference pod. Rules:

Smallest-sufficient victim. Free exactly the capacity needed, from the cheapest-to-reclaim source.
Reclaim borrowers first. Capacity borrowed over-quota is the first to go — that's the contract that made borrowing safe.
Gang-granularity preemption only when necessary. Because a gang is all-or-nothing, preempting any member kills the whole gang — so only preempt a gang when no smaller victim suffices, and pick the smallest / lowest-priority gang.

Preemption is a policy threshold, not a reflex

Preemption cost (lost training progress, restart/warmup time) must be weighed against the inference SLO gain. If the inference burst can be served from idle or borrowed capacity, don’t preempt training at all. “Always preempt for higher priority” is the Senior answer; “preempt when the SLO gain exceeds the progress-loss cost, and prefer non-disruptive sources first” is the Staff answer.

Noisy-neighbor failure modes per primitive

The isolation tier you chose in Step 2 determines what a misbehaving tenant can do to its neighbors:

Primitive

OOM blast radius

Runaway kernel

Use when

Time-slicing

Crashes neighbors (shared memory)

Starves neighbors

Single trusted tenant only

MPS

Address-space separated, but a fatal fault can crash co-clients

Can hog SMs

Trusted, same-team inference

MIG

Contained to the slice (HW)

Contained to the slice

Untrusted / multi-tenant

Time-slicing isolates neither memory nor faults — one client's OOM can crash everyone sharing the GPU. Only safe single-tenant.
MPS gives memory address-space separation but no fault isolation: one client's fatal fault can take down co-clients on the same MPS daemon. Enforce per-client memory limits and accept the shared-fault risk only within one trust domain.
MIG contains an OOM or runaway kernel to its hardware slice — the strongest, and the right choice whenever tenants don't trust each other.

Choose the isolation tier by tenant trust. Same team, cooperative inference → MPS is fine and cheaper. Different teams or untrusted code → MIG, and pay its throughput tax to buy hardware isolation.

Rollout — observability: measuring real utilization

You cannot raise utilization you can’t measure, and the default metric lies — so observability is the first thing to roll out, before any packing aggressiveness.

Why nvidia-smi lies

nvidia-smi “GPU util %” reports ~100% if any kernel executed during the sample window — a single tiny kernel touching one SM reads as fully busy. Optimizing against it tells you a GPU handed out to a near-idle notebook is “100% utilized.” It is the single most misleading number in the stack; do not put it on the dashboard that drives decisions.

The right SLIs

Metric

What it tells you

Source

SM_ACTIVE

Fraction of cycles with ≥1 warp resident

DCGM

SM_OCCUPANCY

How full the SMs are when active

DCGM

Tensor-core active

Are the tensor cores actually doing matmuls

DCGM

MFU

Model FLOPS Utilization vs peak (training)

Job-level

Allocatable vs used GPU-hours

Are GPUs handed out and actually computing

Scheduler

Fragmentation

Free-but-unschedulable slices

Scheduler

Stack: DCGM-exporter → Prometheus → Grafana, with per-tenant GPU-hour accounting for chargeback.

Two utilization numbers, always reported together

Allocation utilization — are GPUs handed out to tenants?
Hardware utilization — are the handed-out GPUs actually computing (SM_ACTIVE/MFU)?

The gap between them is idle-but-reserved waste — capacity a tenant holds but doesn’t use. That gap, not the headline number, is where the recoverable utilization lives.

Fragmentation metric

Track free-but-unschedulable GPU/slice count: capacity that physically exists but no pending job fits, due to MIG geometry or topology constraints. This is hidden waste — the fleet looks full but isn’t.

Alerting

Idle hoarding — a tenant holding allocations at low SM_ACTIVE for a sustained window → notify and/or reclaim.
Sustained fragmentation → trigger defragmentation or a (rate-limited) MIG re-geometry.

Bottlenecks — failure modes, scaling & trade-off recap

Failure modes

Failure

Symptom

Mitigation

Scheduler outage

New scheduling stalls

Running pods keep computing; leader-elected scheduler, persisted quota/PodGroup state

Gang deadlock

Reservation never completes

Timeout + release; detect cycles where reservations can't progress

MIG re-geometry storm

GPUs drained repeatedly

Rate-limit geometry changes; treat layout as semi-static per node pool

Preemption storm

Oscillating preempt/resume

Hysteresis + minimum-run-time guarantee before a job is preemptible

Scheduler scale

Slow decisions at thousands of pending pods

Caching, batch admission, per-pool sharding

HA notes

The control plane is off the data path: a scheduler outage must not kill running jobs. Running pods keep computing on their bound GPUs; only new scheduling pauses. Use a leader-elected scheduler with persisted quota and PodGroup state so a failover resumes cleanly.

Scaling the scheduler

Matching thousands of pending pods against topology constraints is combinatorially expensive. Use a scheduling framework with a cached cluster snapshot, batch admission (admit a gang as a unit, not pod-by-pod), and per-pool sharding so independent node pools schedule in parallel.

Final trade-off matrix

Knob

Pushes toward

At the cost of

MIG over MPS

Isolation, no noisy neighbor

Throughput tax, re-geometry cost

MPS over MIG

Throughput, packing density

Shared fault risk (trusted only)

Strict topology

Communication efficiency (MFU)

Fewer placements, higher queue time

Loose topology

Utilization, lower queue time

Slower steps for comm-bound jobs

Aggressive preemption

Inference p99

Lost training progress

Lazy preemption

Training throughput

Risk to inference SLO

Every knob moves the same four forces — utilization ↔ isolation ↔ gang feasibility ↔ SLO — and the right setting depends on tenant trust and workload mix, not on a universal default.

✓

Summary

Four things separate Staff from Senior on this problem:

1. Picks the partitioning primitive by trust/isolation/throughput regime — MIG for hard multi-tenant isolation, MPS for trusted same-team inference, time-slicing only for dev — and accounts for the fractioning throughput tax and the MIG re-geometry cost. Not “just use MIG.”

1. Gets gang + quota + preemption correct under contention — PodGroup/minMember with no partial gangs, reclaimable over-quota borrowing, checkpoint-then-evict instead of kill, and bounded blast radius (smallest-sufficient victim, borrowers first). Not “higher priority wins.”

1. Makes placement topology-aware with real numbers — ~900 GB/s per-GPU NVLink vs ~50 GB/s per-GPU cross-node IB (one 400G NDR NIC/GPU), an order-of-magnitude gap — and knows a TP group must stay on one NVLink island while DP can span nodes, and that MIG slices don’t do NVLink P2P.

1. Measures the right thing — SM_ACTIVE / MFU + allocatable-vs-used + fragmentation, never nvidia-smi “util %” — and ties every decision back to raising real utilization from ~15–20% toward 60%+ without breaking inference p99.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing & SLO classes

Separates training from inference; sets a utilization target

Defines explicit SLO classes (latency-critical inference, best-effort batch, preemptible training), ties them to priority/preemption policy, and frames success as raising real allocatable utilization (~15-20% → 60%+) without breaking inference p99

Partitioning primitive choice

Knows MIG, MPS, and time-slicing exist and roughly what they do

Maps each primitive to a trust/isolation/throughput regime — MIG for hard multi-tenant isolation (hardware-partitioned SMs+memory, ~no noisy neighbor), MPS for trusted same-team inference (spatial, no fault isolation), time-slicing only dev — and accounts for the fractioning throughput tax and MIG reconfig cost

MIG geometry & bin-packing

Treats a GPU as N interchangeable slices

Reasons about valid MIG geometries (A100: 7 compute + 8 memory slices; profiles 1g.5gb…7g.40gb; not all combinations valid), packs jobs to minimize fragmentation, and treats re-partitioning as a disruptive, scheduled operation

Gang scheduling & quota/borrowing

Mentions all-or-nothing for distributed jobs

Specifies PodGroup/minMember semantics, reserves to avoid driver/worker deadlock, designs hierarchical quota with reclaimable over-quota borrowing, and prevents fragmentation-induced gang starvation

Preemption & checkpointing

Says higher priority preempts lower

Designs checkpoint-then-evict for training, bounds preemption blast radius (don't evict a whole 256-GPU gang for one inference pod), makes borrowed capacity reclaimable, and reasons about preemption cost vs SLO gain

Topology-aware placement

Tries to co-locate a job's GPUs on one node

Constrains tensor-parallel groups to a single NVLink island (900 GB/s intra-node vs ~350 GB/s effective IB all-reduce), uses DRA NVLink-clique labels / GFD, and trades packing density against communication cost explicitly

Isolation, observability & failure modes

Adds dashboards for GPU utilization

Names noisy-neighbor failure modes per primitive, measures the right SLI (SM_ACTIVE/MFU + allocatable-vs-used, not nvidia-smi 'util %'), and designs against OOM/runaway-kernel blast radius with the chosen isolation tier

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →

Frame — the utilization crisis & who's asking

Requirements & scale envelope

Estimation — partitioning primitives: MIG vs MPS vs time-slicing

API design — scheduler architecture & quota/borrowing

Data model — gang scheduling for distributed jobs

High-level architecture — topology-aware placement & NVLink islands

Deep dive — preemption, isolation & noisy-neighbor defense

Rollout — observability: measuring real utilization

Bottlenecks — failure modes, scaling & trade-off recap

Summary

Rubric — Senior vs Staff

Related questions

Want more breakdowns like this?