← Back to all questions
AI System DesignStaffGPU SchedulingMulti-Tenancy

Design a Fractional / Multi-Tenant GPU Sharing Platform

A GPU control plane is what companies like NVIDIA (Run:ai), CoreWeave, Lambda, and Together AI build and interview on: take a fleet of expensive accelerators that typically sit at ~15-20% real utilization, and drive it up by spatially partitioning GPUs and packing many tenants onto them — without one job's OOM or runaway kernel starving its neighbor. The Staff bar is reasoning about the gang-scheduling-vs-fragmentation tension, when MIG's hardware isolation is worth its throughput tax versus cheap-but-leaky MPS, and how to preempt a 500-GPU training run for a latency-critical inference burst without losing days of work.

Level
Staff
Category
AI System Design
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Picks the right partitioning primitive per workload: MIG for hard multi-tenant isolation, MPS for trusted co-located inference, time-slicing only for dev/bursty — and knows the throughput tax of each
·Designs gang / all-or-nothing scheduling for distributed jobs with minMember/PodGroup semantics and explicit deadlock + fragmentation avoidance
·Handles preemption of training for latency-critical inference: priority classes, checkpoint-then-evict, bounded blast radius, and quota borrowing that's reclaimable
·Makes placement topology-aware: keeps a TP group inside one NVLink island (900 GB/s) rather than spanning nodes over ~350 GB/s effective IB all-reduce
★ STAFF-LEVEL SIGNALS
Quantifies the utilization gap (perceived ~60% vs real ~14-20%) and ties every design choice back to driving allocatable utilization up without hurting p99 or MFU
Names the noisy-neighbor failure modes precisely — MPS lacks fault isolation so one client's fault/OOM can crash co-tenants; time-slicing isolates neither memory nor faults — and isolates accordingly
Treats MIG reconfiguration as a first-class cost: profiles are set at the GPU level, draining + re-partitioning is disruptive, so geometry churn is a scheduling decision not a free knob
Defines the right SLI: allocatable-vs-used and SM_ACTIVE/MFU, not the misleading nvidia-smi 'GPU util %' that reads 100% for a single tiny kernel
0

Frame — the utilization crisis & who's asking

“A $30k/yr GPU at 15% utilization is ~$25k/yr of waste — the whole job is to safely raise that number without breaking inference p99.”

GPU sharing is a control-plane problem dressed up as a hardware problem. The fleet is the most expensive line item in the company, and most of it is idle most of the time. The job is to drive real utilization up by spatially partitioning GPUs and packing many tenants onto them — while guaranteeing that one tenant’s OOM, runaway kernel, or 3 a.m. training crash never starves the inference replica sitting next to it.

The numbers that motivate the whole design. Most GPU clusters run at roughly 14–20% average real utilization while the teams who own them perceive something closer to 60%. Industry surveys (e.g. the State of AI Infra reports) find that ~83% of organizations admit they underutilize their AI hardware. The gap between perceived and real is the prize: closing it is worth millions per year on a large fleet.

Who asks this & what they probe

Role
What they own here
What the interviewer probes
SDE
The control plane: quota accountant, gang scheduler, bin-packer, preemptor
Correctness under contention — no partial gangs, no quota leaks, no deadlock, bounded preemption blast radius
MLE
How job shapes map to hardware
A 7B replica wants a 1g.10gb slice; TP-8 wants one NVLink island; the FLOPS cost of fractioning; SLO classes
Switcher (SDE → AI)
What transfers from Borg/K8s vs the new GPU physics
Bin-packing/quotas transfer; MIG geometry, gang semantics, NVLink topology, checkpoint-preempt are new

Lead with what transfers, then go deep on the GPU physics. A switcher should say: “I know container scheduling, hierarchical quota, and bin-packing from Borg. The new surface is that a GPU is not a bag of CPU millicores — it has MIG slice geometry (7 compute + 8 memory units, not arbitrary fractions), gang/all-or-nothing semantics, NVLink topology that matters an order of magnitude for placement, and preempt-with-checkpoint instead of just killing a pod.”

SLO classes drive every policy

Define three classes up front, because they determine priority and preemption:

1. Latency-critical inference — has a p99 SLO, never preemptible, may cause preemption.

2. Best-effort / interactive — notebooks, dev, evals; tolerates queuing.

3. Preemptible batch training — long-running, checkpointable, first to be preempted.

The central tension

Four forces pull against each other, and naming them up front is the Staff signal:

  • Utilization wants tight packing and aggressive GPU sharing.
  • Isolation wants tenants kept apart (no noisy neighbor).
  • Gang feasibility wants large contiguous, topology-aligned blocks free at once.
  • SLO wants the ability to preempt fast for an inference burst.

Every design knob below moves these four. There is no globally correct setting — only the right setting for a given tenant-trust and workload mix.

Scope. This is the control plane + scheduler. It is not the serving engine (batching, KV-cache, speculative decoding — that’s a model-serving design) and not a generic CPU/memory container scheduler (that’s a Borg/task-scheduler design). We assume containers, networking, and storage exist; we own which accelerator each job lands on and when.

1

Requirements & scale envelope

Functional requirements

  • Submit jobs of arbitrary GPU shape: a fraction of a GPU, 1, 8, or 256 GPUs.
  • Per-team quota with borrowing of idle capacity above quota.
  • Gang admission: all-or-nothing for distributed jobs.
  • Preempt-for-inference: reclaim training capacity for latency-critical bursts.
  • Topology constraints: keep a tensor-parallel group on one NVLink island.
  • Observability: surface real utilization and fragmentation per tenant.

Non-functional requirements

  • Scheduling decision latency: inference admission p99 under ~1–2s; batch may queue for minutes.
  • Fairness across tenants; no single tenant monopolizes borrowed idle capacity.
  • No partial gangs; bounded preemption blast radius.
  • HA control plane: a scheduler outage must not kill running jobs.

Scale envelope

Dimension
Value (worked example)
Cluster size
~4,000 GPUs = 500 × 8-GPU H100 SXM nodes
Tenants
Hundreds of teams/projects
Job mix
Thousands of small inference replicas + tens of large training gangs
Largest gang
256–512 GPUs spanning many nodes over InfiniBand
Smallest job
1 MIG slice (a fraction of one GPU)
Inference admit latency
p99 under ~1–2s

Job-shape examples

Workload
Shape
Hardware ask
7B inference replica
Fits in one slice
1g.10gb or 2g.20gb MIG on H100
70B inference
Tensor-parallel
TP-4/8 on one NVLink node
Pre-training gang
64–512 GPUs
Many nodes, rail-aligned IB
Dev / eval notebook
Bursty, low duty cycle
Time-sliced share

Fractioning is not free. Splitting a GPU into 7 MIG slices yields ~7 isolated instances, but each gets only a fraction of the SMs and a fraction of HBM bandwidth. For bandwidth-bound or large-kernel workloads, aggregate throughput across the 7 slices is less than one undivided GPU. Fractioning buys isolation and packing density, and it pays a throughput tax — quantify the tax per workload rather than assuming sharing is always a win.

Explicitly out of scope: the inference batching / KV-cache engine, model-weights storage, and the data pipeline.

2

Estimation — partitioning primitives: MIG vs MPS vs time-slicing

This is the first irreversible choice, so we estimate the cost of each primitive before picking. There are exactly three ways to put more than one job on a single physical GPU.

Primitive
Mechanism
Granularity
Memory isolation
Fault isolation
Throughput tax
Noisy neighbor
MIG
HW-partitioned SMs + memory
Fixed slice profiles
Strong (hardware)
Strong (hardware)
Per-slice fraction of SM/BW
Near-zero
MPS
Spatial, shared context
Per-process %
Address-space sep. only
None
Low (no ctx-switch)
Real — shared fault
Time-slice
Round-robin temporal
Whole GPU in turns
None
None
Context-switch overhead
Severe

MIG (Multi-Instance GPU) hardware-partitions one GPU into independent instances, each with dedicated SMs, L2 slices, and memory channels. On an A100-40GB the budget is 7 compute slices + 8 memory slices, exposed as profiles 1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb. H100-80GB is analogous (1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb). Because the partition is in silicon, an OOM or a runaway kernel in one instance cannot touch another — the strongest isolation, near-zero noisy neighbor.

MPS (Multi-Process Service) lets multiple processes submit kernels to one GPU concurrently with no context-switch overhead — great for packing many small kernels that each underfill the GPU. On Volta+ each client gets its own GPU address space (some memory protection), but there is no fault isolation: one client’s fatal fault or runaway can crash co-tenants sharing the MPS daemon. Use it only for trusted, same-team inference.

Time-slicing round-robins the whole GPU between processes in time. It isolates neither memory nor faults and is pure oversubscription. It’s fine for dev/CI where bursty, low-duty-cycle jobs rarely collide — and dangerous for anything else.

When to use which

  • MIG → hard multi-tenant isolation: different teams sharing one GPU, or any untrusted co-tenancy.
  • MPS → trusted, co-located inference that needs many concurrent small kernels (same team, same trust domain).
  • Time-slicing → dev / CI / interactive only.

Staff nuance — MIG geometry is a scheduling cost, not a free knob. A MIG layout is set at the GPU level, and changing it requires draining every instance on that GPU and re-partitioning — disruptive to anything running. So “re-geometry this GPU from 7×1g to 1×7g” is a scheduled operation with a cost, not something the bin-packer does per job. And not all geometries are valid: you can’t freely mix arbitrary slice sizes; the scheduler must pick from the supported profile layouts for that GPU.

3

API design — scheduler architecture & quota/borrowing

The request contract

A job declares its shape and class; the control plane decides where and when. With Dynamic Resource Allocation (DRA), GA in Kubernetes 1.34, a job can request structured resources instead of an opaque integer GPU count:

apiVersion: scheduling/v1
kind: GpuJob
metadata: { name: llama70b-tp8, namespace: team-search }
spec:
sloClass: latency-critical-inference # never preemptible
priorityClass: inference-high
podGroup:
minMember: 8 # gang: all-or-nothing
resourceClaim:
gpus: 8
constraints:
sameNVLinkClique: true # TP must stay on one island
minMemoryPerGpuGB: 80
partition: full # not a MIG slice
preemptionPolicy: { preemptible: false }

A small inference replica instead asks for a slice. The replica itself is never preemptible, but any capacity it holds over its quota (borrowed from idle fleet) is reclaimable from it:

resourceClaim:
gpus: 1
constraints: { migProfile: "1g.10gb" }
preemptionPolicy: { preemptible: false } # job not preemptible;
# borrowed capacity is reclaimable

Build vs buy

The mainstream 2026 answer is Kubernetes-native: a gang scheduler (Volcano, Kueue, or NVIDIA’s KAI Scheduler) plus DRA for structured GPU claims. A hyperscaler with extreme scale may run a Borg-like custom scheduler instead. We’ll describe the K8s+DRA shape; the concepts port to custom.

Components

Job API / CRD
|
Admission + Quota controller <-- hierarchical quota, borrowing
|
Scheduler ( gang plug-in
+ topology-aware bin-packer
+ preemptor )
|
DRA driver / device plugin <-- structured claims -> devices
|
Node agent <-- configures MIG, runs MPS
daemon, binds slices to pods

Hierarchical quota & borrowing

Quota is a tree: org → team → project. Each node has a guaranteed quota (always honored) plus an over-quota borrowing weight (share of idle fleet capacity it may temporarily use).

Concept
Rule
Guaranteed quota
Always schedulable for the owner; never preempted away
Borrowing
Use idle capacity above quota when the fleet is slack
Reclaim
Borrowed capacity is preempted first when the owner returns
Fair-share
Time-based fairshare prevents one tenant hoarding idle capacity

The key invariant: borrowed capacity is reclaimable. When the rightful owner submits, the scheduler preempts the borrower — so borrowing is always safe to grant and never a liability.

Bin-packing objective

Minimize GPU fragmentation — don’t leave a 6g hole no job fits into — while honoring topology. These two goals conflict: the tightest pack may scatter a gang across nodes, and the most topology-clean placement may strand capacity. State the tradeoff explicitly; it’s a tunable objective, not a fixed rule.

DRA is the enabler here: a job can say “4 H100 in the same NVLink clique, ≥80GB HBM each” as a structured claim rather than “4 GPUs,” so the scheduler reasons about topology natively.

4

Data model — gang scheduling for distributed jobs

Why all-or-nothing

A tensor-parallel or data-parallel training step needs every rank running simultaneously — rank 3 can’t start the all-reduce until ranks 0–N are alive. If the scheduler admits 200 of 256 pods, those 200 GPUs sit idle and reserved, burning money while their pods spin waiting for the missing 56. Gang scheduling prevents exactly this: nothing starts until the whole group can.

PodGroup / minMember

Volcano, Kueue, and KAI model this with a PodGroup carrying minMember:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata: { name: pretrain-256 }
spec:
minMember: 256 # none start until 256 slots reservable
minResources: # reserve to avoid driver/worker deadlock
nvidia.com/gpu: 256
queue: team-foundation

minMember: 256 means the scheduler reserves 256 slots atomically; if it can’t, zero pods bind and the gang waits as a unit.

Failure modes to design against

  • Driver/worker deadlock. A classic Spark-style trap: a coordinator pod grabs a GPU and waits for workers it can never schedule because it is holding the last slot. Fix: reserve minResources for the whole group so partial acquisition can't happen, plus a timeout to release a gang stuck reserving.
  • Fragmentation starvation. Many small jobs scatter across nodes and leave no contiguous, topology-aligned block for a 256-GPU gang — so large gangs starve indefinitely even though total free GPUs exceed 256. Mitigate with reservation + backfill and periodic defragmentation.
  • Backfill. While accumulating slots for a big gang, run short, evictable best-effort jobs in the gaps so the reserved-but-not-yet-full capacity still does useful work — then evict them the instant the gang can land.

minMember alone is necessary but not sufficient: the 256 must land on the right NVLink/IB topology, which is the next step.

5

High-level architecture — topology-aware placement & NVLink islands

The bandwidth hierarchy

Placement matters because the interconnect is wildly non-uniform:

Link
Bandwidth (per GPU)
Scope
H100 NVLink-4 via NVSwitch
~900 GB/s bidirectional
Within an 8-GPU node (non-blocking)
InfiniBand NDR 400G
~50 GB/s per 400G NIC (one NIC/GPU)
Across nodes

That’s roughly an order-of-magnitude (~15–18x) gap per GPU. Where a job’s GPUs sit changes its throughput dramatically.

Placement rules

  • Tensor-parallel (TP) groups must stay inside one NVLink island (one node). TP exchanges activations every layer — it is the most communication-heavy parallelism. Spanning a TP group across nodes drops it from ~900 GB/s onto ~50 GB/s per-GPU IB and tanks step throughput.
  • Data-parallel (DP) replicas can span nodes. DP syncs gradients once per step (less frequent), so the cross-node penalty is tolerable.
  • For a 70B TP-8 job: pin all 8 ranks to one node's NVSwitch domain. For a DP-over-TP job: each TP-8 group on its own node, DP across nodes over IB.

How the scheduler knows the topology

DRA + GPU Feature Discovery labels nodes with nvidia.com/gpu.clique (NVLink Domain + Clique ID). The scheduler uses node affinity / structured claims to keep a gang within one clique. For DP gangs that must cross nodes, use rail-optimized placement: keep each cross-node all-reduce on the same IB rail / leaf switch so traffic avoids oversubscribed spine hops.

The tradeoff to name

Strict topology constraints reduce the set of valid placements, which can lower utilization and raise queue time — sometimes a gang waits for the right island while the wrong islands sit idle. Packing density vs communication efficiency is a real dial, and the right setting depends on how comm-bound the workload is.

MIG ≠ NVLink. MIG instances do not participate in NVLink P2P the way full GPUs do. So MIG is for independent small jobs, never for a TP group that needs NVLink — a TP group needs whole GPUs on one island.

6

Deep dive — preemption, isolation & noisy-neighbor defense

WHERE STAFF IS WON

This is where Staff is won. Two hard problems live here: preempting a giant training run for an inference burst without losing days of work, and choosing the isolation tier that contains the right blast radius per tenant.

Preempt training for inference — checkpoint-then-evict

When a latency-critical inference burst arrives and the fleet is full of preemptible training, the naive move — kill a training pod — throws away hours or days of progress. The correct flow:

inference burst (priority=high) needs N GPUs
|
1. select victims -> prefer borrowers, then lowest-priority,
smallest-sufficient training gang
2. signal training -> "checkpoint now" (cooperative)
3. wait for checkpoint (bounded grace period)
4. evict gang, free GPUs
5. schedule inference
...
6. later: training resumes FROM CHECKPOINT when capacity returns

The difference between checkpoint-then-evict and kill is the difference between losing minutes (the time since the last checkpoint) and losing days. A grace period bounds how long we wait for the checkpoint before forcing eviction.

Bound the blast radius

The cardinal sin is evicting a whole 256-GPU training gang to free one MIG slice for one inference pod. Rules:

  • Smallest-sufficient victim. Free exactly the capacity needed, from the cheapest-to-reclaim source.
  • Reclaim borrowers first. Capacity borrowed over-quota is the first to go — that's the contract that made borrowing safe.
  • Gang-granularity preemption only when necessary. Because a gang is all-or-nothing, preempting any member kills the whole gang — so only preempt a gang when no smaller victim suffices, and pick the smallest / lowest-priority gang.

Preemption is a policy threshold, not a reflex

Preemption cost (lost training progress, restart/warmup time) must be weighed against the inference SLO gain. If the inference burst can be served from idle or borrowed capacity, don’t preempt training at all. “Always preempt for higher priority” is the Senior answer; “preempt when the SLO gain exceeds the progress-loss cost, and prefer non-disruptive sources first” is the Staff answer.

Noisy-neighbor failure modes per primitive

The isolation tier you chose in Step 2 determines what a misbehaving tenant can do to its neighbors:

Primitive
OOM blast radius
Runaway kernel
Use when
Time-slicing
Crashes neighbors (shared memory)
Starves neighbors
Single trusted tenant only
MPS
Address-space separated, but a fatal fault can crash co-clients
Can hog SMs
Trusted, same-team inference
MIG
Contained to the slice (HW)
Contained to the slice
Untrusted / multi-tenant
  • Time-slicing isolates neither memory nor faults — one client's OOM can crash everyone sharing the GPU. Only safe single-tenant.
  • MPS gives memory address-space separation but no fault isolation: one client's fatal fault can take down co-clients on the same MPS daemon. Enforce per-client memory limits and accept the shared-fault risk only within one trust domain.
  • MIG contains an OOM or runaway kernel to its hardware slice — the strongest, and the right choice whenever tenants don't trust each other.

Choose the isolation tier by tenant trust. Same team, cooperative inference → MPS is fine and cheaper. Different teams or untrusted code → MIG, and pay its throughput tax to buy hardware isolation.

7

Rollout — observability: measuring real utilization

You cannot raise utilization you can’t measure, and the default metric lies — so observability is the first thing to roll out, before any packing aggressiveness.

Why nvidia-smi lies

nvidia-smi “GPU util %” reports ~100% if any kernel executed during the sample window — a single tiny kernel touching one SM reads as fully busy. Optimizing against it tells you a GPU handed out to a near-idle notebook is “100% utilized.” It is the single most misleading number in the stack; do not put it on the dashboard that drives decisions.

The right SLIs

Metric
What it tells you
Source
SM_ACTIVE
Fraction of cycles with ≥1 warp resident
DCGM
SM_OCCUPANCY
How full the SMs are when active
DCGM
Tensor-core active
Are the tensor cores actually doing matmuls
DCGM
MFU
Model FLOPS Utilization vs peak (training)
Job-level
Allocatable vs used GPU-hours
Are GPUs handed out and actually computing
Scheduler
Fragmentation
Free-but-unschedulable slices
Scheduler

Stack: DCGM-exporter → Prometheus → Grafana, with per-tenant GPU-hour accounting for chargeback.

Two utilization numbers, always reported together

  • Allocation utilization — are GPUs handed out to tenants?
  • Hardware utilization — are the handed-out GPUs actually computing (SM_ACTIVE/MFU)?

The gap between them is idle-but-reserved waste — capacity a tenant holds but doesn’t use. That gap, not the headline number, is where the recoverable utilization lives.

Fragmentation metric

Track free-but-unschedulable GPU/slice count: capacity that physically exists but no pending job fits, due to MIG geometry or topology constraints. This is hidden waste — the fleet looks full but isn’t.

Alerting

  • Idle hoarding — a tenant holding allocations at low SM_ACTIVE for a sustained window → notify and/or reclaim.
  • Sustained fragmentation → trigger defragmentation or a (rate-limited) MIG re-geometry.
8

Bottlenecks — failure modes, scaling & trade-off recap

Failure modes

Failure
Symptom
Mitigation
Scheduler outage
New scheduling stalls
Running pods keep computing; leader-elected scheduler, persisted quota/PodGroup state
Gang deadlock
Reservation never completes
Timeout + release; detect cycles where reservations can't progress
MIG re-geometry storm
GPUs drained repeatedly
Rate-limit geometry changes; treat layout as semi-static per node pool
Preemption storm
Oscillating preempt/resume
Hysteresis + minimum-run-time guarantee before a job is preemptible
Scheduler scale
Slow decisions at thousands of pending pods
Caching, batch admission, per-pool sharding

HA notes

The control plane is off the data path: a scheduler outage must not kill running jobs. Running pods keep computing on their bound GPUs; only new scheduling pauses. Use a leader-elected scheduler with persisted quota and PodGroup state so a failover resumes cleanly.

Scaling the scheduler

Matching thousands of pending pods against topology constraints is combinatorially expensive. Use a scheduling framework with a cached cluster snapshot, batch admission (admit a gang as a unit, not pod-by-pod), and per-pool sharding so independent node pools schedule in parallel.

Final trade-off matrix

Knob
Pushes toward
At the cost of
MIG over MPS
Isolation, no noisy neighbor
Throughput tax, re-geometry cost
MPS over MIG
Throughput, packing density
Shared fault risk (trusted only)
Strict topology
Communication efficiency (MFU)
Fewer placements, higher queue time
Loose topology
Utilization, lower queue time
Slower steps for comm-bound jobs
Aggressive preemption
Inference p99
Lost training progress
Lazy preemption
Training throughput
Risk to inference SLO

Every knob moves the same four forces — utilization ↔ isolation ↔ gang feasibility ↔ SLO — and the right setting depends on tenant trust and workload mix, not on a universal default.

Summary

Four things separate Staff from Senior on this problem:

1. Picks the partitioning primitive by trust/isolation/throughput regime — MIG for hard multi-tenant isolation, MPS for trusted same-team inference, time-slicing only for dev — and accounts for the fractioning throughput tax and the MIG re-geometry cost. Not “just use MIG.”

1. Gets gang + quota + preemption correct under contention — PodGroup/minMember with no partial gangs, reclaimable over-quota borrowing, checkpoint-then-evict instead of kill, and bounded blast radius (smallest-sufficient victim, borrowers first). Not “higher priority wins.”

1. Makes placement topology-aware with real numbers~900 GB/s per-GPU NVLink vs ~50 GB/s per-GPU cross-node IB (one 400G NDR NIC/GPU), an order-of-magnitude gap — and knows a TP group must stay on one NVLink island while DP can span nodes, and that MIG slices don’t do NVLink P2P.

1. Measures the right thingSM_ACTIVE / MFU + allocatable-vs-used + fragmentation, never nvidia-smi “util %” — and ties every decision back to raising real utilization from ~15–20% toward 60%+ without breaking inference p99.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Problem framing & SLO classes
Separates training from inference; sets a utilization target
Defines explicit SLO classes (latency-critical inference, best-effort batch, preemptible training), ties them to priority/preemption policy, and frames success as raising real allocatable utilization (~15-20% → 60%+) without breaking inference p99
Partitioning primitive choice
Knows MIG, MPS, and time-slicing exist and roughly what they do
Maps each primitive to a trust/isolation/throughput regime — MIG for hard multi-tenant isolation (hardware-partitioned SMs+memory, ~no noisy neighbor), MPS for trusted same-team inference (spatial, no fault isolation), time-slicing only dev — and accounts for the fractioning throughput tax and MIG reconfig cost
MIG geometry & bin-packing
Treats a GPU as N interchangeable slices
Reasons about valid MIG geometries (A100: 7 compute + 8 memory slices; profiles 1g.5gb…7g.40gb; not all combinations valid), packs jobs to minimize fragmentation, and treats re-partitioning as a disruptive, scheduled operation
Gang scheduling & quota/borrowing
Mentions all-or-nothing for distributed jobs
Specifies PodGroup/minMember semantics, reserves to avoid driver/worker deadlock, designs hierarchical quota with reclaimable over-quota borrowing, and prevents fragmentation-induced gang starvation
Preemption & checkpointing
Says higher priority preempts lower
Designs checkpoint-then-evict for training, bounds preemption blast radius (don't evict a whole 256-GPU gang for one inference pod), makes borrowed capacity reclaimable, and reasons about preemption cost vs SLO gain
Topology-aware placement
Tries to co-locate a job's GPUs on one node
Constrains tensor-parallel groups to a single NVLink island (900 GB/s intra-node vs ~350 GB/s effective IB all-reduce), uses DRA NVLink-clique labels / GFD, and trades packing density against communication cost explicitly
Isolation, observability & failure modes
Adds dashboards for GPU utilization
Names noisy-neighbor failure modes per primitive, measures the right SLI (SM_ACTIVE/MFU + allocatable-vs-used, not nvidia-smi 'util %'), and designs against OOM/runaway-kernel blast radius with the chosen isolation tier
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →