Design a Fractional / Multi-Tenant GPU Sharing Platform
A GPU control plane is what companies like NVIDIA (Run:ai), CoreWeave, Lambda, and Together AI build and interview on: take a fleet of expensive accelerators that typically sit at ~15-20% real utilization, and drive it up by spatially partitioning GPUs and packing many tenants onto them — without one job's OOM or runaway kernel starving its neighbor. The Staff bar is reasoning about the gang-scheduling-vs-fragmentation tension, when MIG's hardware isolation is worth its throughput tax versus cheap-but-leaky MPS, and how to preempt a 500-GPU training run for a latency-critical inference burst without losing days of work.
Frame — the utilization crisis & who's asking
“A $30k/yr GPU at 15% utilization is ~$25k/yr of waste — the whole job is to safely raise that number without breaking inference p99.”
GPU sharing is a control-plane problem dressed up as a hardware problem. The fleet is the most expensive line item in the company, and most of it is idle most of the time. The job is to drive real utilization up by spatially partitioning GPUs and packing many tenants onto them — while guaranteeing that one tenant’s OOM, runaway kernel, or 3 a.m. training crash never starves the inference replica sitting next to it.
The numbers that motivate the whole design. Most GPU clusters run at roughly 14–20% average real utilization while the teams who own them perceive something closer to 60%. Industry surveys (e.g. the State of AI Infra reports) find that ~83% of organizations admit they underutilize their AI hardware. The gap between perceived and real is the prize: closing it is worth millions per year on a large fleet.
Who asks this & what they probe
Lead with what transfers, then go deep on the GPU physics. A switcher should say: “I know container scheduling, hierarchical quota, and bin-packing from Borg. The new surface is that a GPU is not a bag of CPU millicores — it has MIG slice geometry (7 compute + 8 memory units, not arbitrary fractions), gang/all-or-nothing semantics, NVLink topology that matters an order of magnitude for placement, and preempt-with-checkpoint instead of just killing a pod.”
SLO classes drive every policy
Define three classes up front, because they determine priority and preemption:
1. Latency-critical inference — has a p99 SLO, never preemptible, may cause preemption.
2. Best-effort / interactive — notebooks, dev, evals; tolerates queuing.
3. Preemptible batch training — long-running, checkpointable, first to be preempted.
The central tension
Four forces pull against each other, and naming them up front is the Staff signal:
- Utilization wants tight packing and aggressive GPU sharing.
- Isolation wants tenants kept apart (no noisy neighbor).
- Gang feasibility wants large contiguous, topology-aligned blocks free at once.
- SLO wants the ability to preempt fast for an inference burst.
Every design knob below moves these four. There is no globally correct setting — only the right setting for a given tenant-trust and workload mix.
Scope. This is the control plane + scheduler. It is not the serving engine (batching, KV-cache, speculative decoding — that’s a model-serving design) and not a generic CPU/memory container scheduler (that’s a Borg/task-scheduler design). We assume containers, networking, and storage exist; we own which accelerator each job lands on and when.
Requirements & scale envelope
Functional requirements
- Submit jobs of arbitrary GPU shape: a fraction of a GPU, 1, 8, or 256 GPUs.
- Per-team quota with borrowing of idle capacity above quota.
- Gang admission: all-or-nothing for distributed jobs.
- Preempt-for-inference: reclaim training capacity for latency-critical bursts.
- Topology constraints: keep a tensor-parallel group on one NVLink island.
- Observability: surface real utilization and fragmentation per tenant.
Non-functional requirements
- Scheduling decision latency: inference admission p99 under ~1–2s; batch may queue for minutes.
- Fairness across tenants; no single tenant monopolizes borrowed idle capacity.
- No partial gangs; bounded preemption blast radius.
- HA control plane: a scheduler outage must not kill running jobs.
Scale envelope
Job-shape examples
Fractioning is not free. Splitting a GPU into 7 MIG slices yields ~7 isolated instances, but each gets only a fraction of the SMs and a fraction of HBM bandwidth. For bandwidth-bound or large-kernel workloads, aggregate throughput across the 7 slices is less than one undivided GPU. Fractioning buys isolation and packing density, and it pays a throughput tax — quantify the tax per workload rather than assuming sharing is always a win.
Explicitly out of scope: the inference batching / KV-cache engine, model-weights storage, and the data pipeline.
Estimation — partitioning primitives: MIG vs MPS vs time-slicing
This is the first irreversible choice, so we estimate the cost of each primitive before picking. There are exactly three ways to put more than one job on a single physical GPU.
MIG (Multi-Instance GPU) hardware-partitions one GPU into independent instances, each with dedicated SMs, L2 slices, and memory channels. On an A100-40GB the budget is 7 compute slices + 8 memory slices, exposed as profiles 1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, 7g.40gb. H100-80GB is analogous (1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb). Because the partition is in silicon, an OOM or a runaway kernel in one instance cannot touch another — the strongest isolation, near-zero noisy neighbor.
MPS (Multi-Process Service) lets multiple processes submit kernels to one GPU concurrently with no context-switch overhead — great for packing many small kernels that each underfill the GPU. On Volta+ each client gets its own GPU address space (some memory protection), but there is no fault isolation: one client’s fatal fault or runaway can crash co-tenants sharing the MPS daemon. Use it only for trusted, same-team inference.
Time-slicing round-robins the whole GPU between processes in time. It isolates neither memory nor faults and is pure oversubscription. It’s fine for dev/CI where bursty, low-duty-cycle jobs rarely collide — and dangerous for anything else.
When to use which
- MIG → hard multi-tenant isolation: different teams sharing one GPU, or any untrusted co-tenancy.
- MPS → trusted, co-located inference that needs many concurrent small kernels (same team, same trust domain).
- Time-slicing → dev / CI / interactive only.
Staff nuance — MIG geometry is a scheduling cost, not a free knob. A MIG layout is set at the GPU level, and changing it requires draining every instance on that GPU and re-partitioning — disruptive to anything running. So “re-geometry this GPU from 7×1g to 1×7g” is a scheduled operation with a cost, not something the bin-packer does per job. And not all geometries are valid: you can’t freely mix arbitrary slice sizes; the scheduler must pick from the supported profile layouts for that GPU.
API design — scheduler architecture & quota/borrowing
The request contract
A job declares its shape and class; the control plane decides where and when. With Dynamic Resource Allocation (DRA), GA in Kubernetes 1.34, a job can request structured resources instead of an opaque integer GPU count:
A small inference replica instead asks for a slice. The replica itself is never preemptible, but any capacity it holds over its quota (borrowed from idle fleet) is reclaimable from it:
Build vs buy
The mainstream 2026 answer is Kubernetes-native: a gang scheduler (Volcano, Kueue, or NVIDIA’s KAI Scheduler) plus DRA for structured GPU claims. A hyperscaler with extreme scale may run a Borg-like custom scheduler instead. We’ll describe the K8s+DRA shape; the concepts port to custom.
Components
Hierarchical quota & borrowing
Quota is a tree: org → team → project. Each node has a guaranteed quota (always honored) plus an over-quota borrowing weight (share of idle fleet capacity it may temporarily use).
The key invariant: borrowed capacity is reclaimable. When the rightful owner submits, the scheduler preempts the borrower — so borrowing is always safe to grant and never a liability.
Bin-packing objective
Minimize GPU fragmentation — don’t leave a 6g hole no job fits into — while honoring topology. These two goals conflict: the tightest pack may scatter a gang across nodes, and the most topology-clean placement may strand capacity. State the tradeoff explicitly; it’s a tunable objective, not a fixed rule.
DRA is the enabler here: a job can say “4 H100 in the same NVLink clique, ≥80GB HBM each” as a structured claim rather than “4 GPUs,” so the scheduler reasons about topology natively.
Data model — gang scheduling for distributed jobs
Why all-or-nothing
A tensor-parallel or data-parallel training step needs every rank running simultaneously — rank 3 can’t start the all-reduce until ranks 0–N are alive. If the scheduler admits 200 of 256 pods, those 200 GPUs sit idle and reserved, burning money while their pods spin waiting for the missing 56. Gang scheduling prevents exactly this: nothing starts until the whole group can.
PodGroup / minMember
Volcano, Kueue, and KAI model this with a PodGroup carrying minMember:
minMember: 256 means the scheduler reserves 256 slots atomically; if it can’t, zero pods bind and the gang waits as a unit.
Failure modes to design against
- Driver/worker deadlock. A classic Spark-style trap: a coordinator pod grabs a GPU and waits for workers it can never schedule because it is holding the last slot. Fix: reserve minResources for the whole group so partial acquisition can't happen, plus a timeout to release a gang stuck reserving.
- Fragmentation starvation. Many small jobs scatter across nodes and leave no contiguous, topology-aligned block for a 256-GPU gang — so large gangs starve indefinitely even though total free GPUs exceed 256. Mitigate with reservation + backfill and periodic defragmentation.
- Backfill. While accumulating slots for a big gang, run short, evictable best-effort jobs in the gaps so the reserved-but-not-yet-full capacity still does useful work — then evict them the instant the gang can land.
minMember alone is necessary but not sufficient: the 256 must land on the right NVLink/IB topology, which is the next step.
High-level architecture — topology-aware placement & NVLink islands
The bandwidth hierarchy
Placement matters because the interconnect is wildly non-uniform:
That’s roughly an order-of-magnitude (~15–18x) gap per GPU. Where a job’s GPUs sit changes its throughput dramatically.
Placement rules
- Tensor-parallel (TP) groups must stay inside one NVLink island (one node). TP exchanges activations every layer — it is the most communication-heavy parallelism. Spanning a TP group across nodes drops it from ~900 GB/s onto ~50 GB/s per-GPU IB and tanks step throughput.
- Data-parallel (DP) replicas can span nodes. DP syncs gradients once per step (less frequent), so the cross-node penalty is tolerable.
- For a 70B TP-8 job: pin all 8 ranks to one node's NVSwitch domain. For a DP-over-TP job: each TP-8 group on its own node, DP across nodes over IB.
How the scheduler knows the topology
DRA + GPU Feature Discovery labels nodes with nvidia.com/gpu.clique (NVLink Domain + Clique ID). The scheduler uses node affinity / structured claims to keep a gang within one clique. For DP gangs that must cross nodes, use rail-optimized placement: keep each cross-node all-reduce on the same IB rail / leaf switch so traffic avoids oversubscribed spine hops.
The tradeoff to name
Strict topology constraints reduce the set of valid placements, which can lower utilization and raise queue time — sometimes a gang waits for the right island while the wrong islands sit idle. Packing density vs communication efficiency is a real dial, and the right setting depends on how comm-bound the workload is.
MIG ≠ NVLink. MIG instances do not participate in NVLink P2P the way full GPUs do. So MIG is for independent small jobs, never for a TP group that needs NVLink — a TP group needs whole GPUs on one island.
Deep dive — preemption, isolation & noisy-neighbor defense
WHERE STAFF IS WONThis is where Staff is won. Two hard problems live here: preempting a giant training run for an inference burst without losing days of work, and choosing the isolation tier that contains the right blast radius per tenant.
Preempt training for inference — checkpoint-then-evict
When a latency-critical inference burst arrives and the fleet is full of preemptible training, the naive move — kill a training pod — throws away hours or days of progress. The correct flow:
The difference between checkpoint-then-evict and kill is the difference between losing minutes (the time since the last checkpoint) and losing days. A grace period bounds how long we wait for the checkpoint before forcing eviction.
Bound the blast radius
The cardinal sin is evicting a whole 256-GPU training gang to free one MIG slice for one inference pod. Rules:
- Smallest-sufficient victim. Free exactly the capacity needed, from the cheapest-to-reclaim source.
- Reclaim borrowers first. Capacity borrowed over-quota is the first to go — that's the contract that made borrowing safe.
- Gang-granularity preemption only when necessary. Because a gang is all-or-nothing, preempting any member kills the whole gang — so only preempt a gang when no smaller victim suffices, and pick the smallest / lowest-priority gang.
Preemption is a policy threshold, not a reflex
Preemption cost (lost training progress, restart/warmup time) must be weighed against the inference SLO gain. If the inference burst can be served from idle or borrowed capacity, don’t preempt training at all. “Always preempt for higher priority” is the Senior answer; “preempt when the SLO gain exceeds the progress-loss cost, and prefer non-disruptive sources first” is the Staff answer.
Noisy-neighbor failure modes per primitive
The isolation tier you chose in Step 2 determines what a misbehaving tenant can do to its neighbors:
- Time-slicing isolates neither memory nor faults — one client's OOM can crash everyone sharing the GPU. Only safe single-tenant.
- MPS gives memory address-space separation but no fault isolation: one client's fatal fault can take down co-clients on the same MPS daemon. Enforce per-client memory limits and accept the shared-fault risk only within one trust domain.
- MIG contains an OOM or runaway kernel to its hardware slice — the strongest, and the right choice whenever tenants don't trust each other.
Choose the isolation tier by tenant trust. Same team, cooperative inference → MPS is fine and cheaper. Different teams or untrusted code → MIG, and pay its throughput tax to buy hardware isolation.
Rollout — observability: measuring real utilization
You cannot raise utilization you can’t measure, and the default metric lies — so observability is the first thing to roll out, before any packing aggressiveness.
Why nvidia-smi lies
nvidia-smi “GPU util %” reports ~100% if any kernel executed during the sample window — a single tiny kernel touching one SM reads as fully busy. Optimizing against it tells you a GPU handed out to a near-idle notebook is “100% utilized.” It is the single most misleading number in the stack; do not put it on the dashboard that drives decisions.
The right SLIs
Stack: DCGM-exporter → Prometheus → Grafana, with per-tenant GPU-hour accounting for chargeback.
Two utilization numbers, always reported together
- Allocation utilization — are GPUs handed out to tenants?
- Hardware utilization — are the handed-out GPUs actually computing (SM_ACTIVE/MFU)?
The gap between them is idle-but-reserved waste — capacity a tenant holds but doesn’t use. That gap, not the headline number, is where the recoverable utilization lives.
Fragmentation metric
Track free-but-unschedulable GPU/slice count: capacity that physically exists but no pending job fits, due to MIG geometry or topology constraints. This is hidden waste — the fleet looks full but isn’t.
Alerting
- Idle hoarding — a tenant holding allocations at low SM_ACTIVE for a sustained window → notify and/or reclaim.
- Sustained fragmentation → trigger defragmentation or a (rate-limited) MIG re-geometry.
Bottlenecks — failure modes, scaling & trade-off recap
Failure modes
HA notes
The control plane is off the data path: a scheduler outage must not kill running jobs. Running pods keep computing on their bound GPUs; only new scheduling pauses. Use a leader-elected scheduler with persisted quota and PodGroup state so a failover resumes cleanly.
Scaling the scheduler
Matching thousands of pending pods against topology constraints is combinatorially expensive. Use a scheduling framework with a cached cluster snapshot, batch admission (admit a gang as a unit, not pod-by-pod), and per-pool sharding so independent node pools schedule in parallel.
Final trade-off matrix
Every knob moves the same four forces — utilization ↔ isolation ↔ gang feasibility ↔ SLO — and the right setting depends on tenant trust and workload mix, not on a universal default.
Summary
Four things separate Staff from Senior on this problem:
1. Picks the partitioning primitive by trust/isolation/throughput regime — MIG for hard multi-tenant isolation, MPS for trusted same-team inference, time-slicing only for dev — and accounts for the fractioning throughput tax and the MIG re-geometry cost. Not “just use MIG.”
1. Gets gang + quota + preemption correct under contention — PodGroup/minMember with no partial gangs, reclaimable over-quota borrowing, checkpoint-then-evict instead of kill, and bounded blast radius (smallest-sufficient victim, borrowers first). Not “higher priority wins.”
1. Makes placement topology-aware with real numbers — ~900 GB/s per-GPU NVLink vs ~50 GB/s per-GPU cross-node IB (one 400G NDR NIC/GPU), an order-of-magnitude gap — and knows a TP group must stay on one NVLink island while DP can span nodes, and that MIG slices don’t do NVLink P2P.
1. Measures the right thing — SM_ACTIVE / MFU + allocatable-vs-used + fragmentation, never nvidia-smi “util %” — and ties every decision back to raising real utilization from ~15–20% toward 60%+ without breaking inference p99.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.