AI System DesignStaffPretraining DataData Pipeline

Design an LLM Pretraining Data Pipeline

A Staff-level walkthrough of the LLM pretraining data pipeline that Anthropic, OpenAI, and Google DeepMind run to produce frontier training corpora. It is an offline, reproducible, ablation-driven batch system whose core hard problems are trillion-scale fuzzy dedup, learned quality filtering, rigorous benchmark decontamination, deliberate data mixing/curriculum, and a tokenized-shard contract that never starves the training cluster — with every curation choice proven by downstream evals rather than perplexity.

Level: Staff
Category: AI Infrastructure · Data Curation
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Trillion-scale dedup: exact/Bloom first pass then MinHash-LSH + union-find connected components

·Filtering hierarchy: cheap heuristics then a learned model-based quality classifier

·Benchmark decontamination via n-gram (13-gram) overlap against eval sets

·Tokenize + global shuffle into packed mmap shards under a data-loader SLA that never starves the cluster

★ STAFF-LEVEL SIGNALS

★Validates every curation choice with downstream-eval ablations, not perplexity or held-out loss

★Materializes and versions every stage as a content-addressed, reproducible dataset 'recipe'

★Knows naive cross-snapshot dedup and naive extra 'high-quality' mixing can hurt (FineWeb/DCLM findings)

★Designs data mixing + end-of-training annealing on top-quality data (DoReMi weights, Llama 3 anneal)

Scope & ambiguity

Let me frame what we’re actually building. The goal is the petabyte-scale offline batch system that turns raw web crawls into a clean, deduplicated, decontaminated, mixed, tokenized multi-trillion-token corpus that a training cluster can stream without ever becoming the bottleneck. This is not a serving system and not the training loop — it’s a one-directional ETL DAG that runs for weeks and produces a versioned dataset artifact. The defining property is that success is measured by downstream model evals via controlled ablations, not by SLAs, row counts, or perplexity. So I’ll treat the dataset as a versioned recipe, and every curation choice I make has to be provable by training a small proxy model and seeing benchmarks move.

This is the kind of system frontier labs — Anthropic, OpenAI, Google DeepMind, Meta — build internally to produce their pretraining corpora, and it’s a common AI-infrastructure interview topic because it sits exactly at the seam between classic distributed systems and ML. The open references that anchor current practice are FineWeb / FineWeb-Edu, DCLM (DataComp-LM), Llama 3’s data section, and HuggingFace’s datatrove pipeline library. I’ll cite them throughout because “I know what’s published” is a real signal here.

Scope boundaries:

In scope: ingestion of raw crawls -> text extraction -> filtering -> dedup -> decontamination -> PII/safety scrub -> data mixing -> tokenization + global shuffle + sharding. The output is a set of tokenized, packed, seekable shards plus a manifest.
Out of scope: the training loop itself, optimizer/parallelism, RLHF / post-training / preference data, and the inference stack. I'll mention the data-loader contract because it constrains the output format, but I won't design the trainer.
Sources: CommonCrawl web snapshots (the bulk), code (GitHub-derived, e.g. The Stack), licensed/high-quality text (books, papers, reference), and increasingly synthetic/rephrased data.
Success criterion: a model trained on this corpus scores better on a held-out downstream eval suite (MMLU, HellaSwag, ARC, GSM8K, HumanEval, etc.) than one trained on the previous recipe, at matched compute.

Who asks this & what they probe

Role

Focus

What they probe

SDE

Offline batch/ETL system

Orchestration of an idempotent restartable DAG over Slurm/Ray/Spark; object-store I/O and shard design to avoid hot keys; dedup as a distributed shuffle + union-find; content-addressed intermediates; versioning, lineage, checkpoint-restart on preempted nodes; caching that makes ablations affordable

MLE

Data quality as the capability driver

The filtering hierarchy (heuristic vs learned); how you train the quality classifier and pick the percentile cutoff; dedup threshold vs diversity; decontamination rigor; data mixing / DoReMi / curriculum; validating with downstream ablations not perplexity; synthetic-data and filtering-bias risks

Switcher (SDE to AI)

ETL instincts transfer; learn the ML stages

Maps Spark/Ray/distributed-systems knowledge directly onto the pipeline, then internalizes the new vocabulary: model-based filtering, MinHash fuzzy dedup, benchmark decontamination, DoReMi mixing, tokenize + global shuffle + packed shards — and the shift that "correct" means a model trains better, provable only by ablation

Requirements

Functional requirements:

Ingest raw crawls (CommonCrawl WARC/WAT/WET), code repos, and licensed corpora from object storage, respecting robots/licensing.
Extract clean text from raw HTML (boilerplate removal, main-content extraction).
Filter in a hierarchy: language ID, cheap heuristic/quality rules, then a learned model-based quality classifier.
Deduplicate at url / document / line granularity, including trillion-scale fuzzy (near-duplicate) dedup.
Decontaminate against the benchmark eval suite so reported numbers are trustworthy.
Scrub PII and apply safety filters (CSAM hashes, toxicity gates where appropriate).
Mix sources with deliberate weights and a curriculum (upsampling, annealing).
Tokenize, globally shuffle, and pack into shards.

Non-functional requirements:

Reproducibility & versioning: a recipe + seed must reproduce a byte-identical dataset; every artifact is addressable and pinned.
Lineage / provenance: for any token, I can trace back to the source document, snapshot, and which stages touched it — needed for audits and takedowns.
Compliance: robots.txt and license respect, opt-out/takedown handling, PII removal.
Fault tolerance: runs on preemptible/spot nodes for weeks; stages must be idempotent and checkpoint-restartable.
Cost ceiling: the full pipeline is O(1M–10M CPU-hours); ablations must not require recomputing it, so heavy caching of content-addressed intermediates is mandatory.

The two requirements that define this system:

Requirement

Why it dominates the design

Data-loader must never starve 10k+ GPUs

The output contract (seekable, packed, mmap shards) is the real "API"; if the loader stalls, millions of dollars of GPUs idle

Every curation choice validated by ablation

"Correctness" is downstream eval movement, not row counts or perplexity; this forces small-proxy-model experiments as a first-class workflow

Back-of-envelope estimation

Input pool. CommonCrawl publishes a snapshot roughly monthly; there are ~90–100 usable snapshots, each ~7 PB of raw WARC. Naively that’s hundreds of PB of raw bytes, but snapshots overlap heavily. After extraction and global dedup the unique raw text pool lands around 200–240 trillion tokens — DCLM operates over a pool of ~240T tokens, which is the canonical number to anchor on.

Stage

Surviving fraction

Resulting scale

Raw WARC (per snapshot)

—

~7 PB x ~95 snapshots

Text extraction (WET-quality)

~10–20% of bytes

hundreds of TB text/snapshot

Language + heuristic filters

keep ~8–15% of bytes

~200–240T token pool

MinHash fuzzy dedup

remove ~50–70%

~70–120T tokens

Model-based quality filter

keep ~top 10% of docs

~10–30T tokens

Final mixed/tokenized corpus

+ upsampling/synthetic

~15–30T tokens

DCLM-Baseline’s headline result is that aggressive model-based filtering keeping roughly the top ~10% of documents beats much larger unfiltered pools — that’s the “better data beats more data” thesis in one number.

Storage budget:

Tier

Contents

Size

Raw

Cached WARC/WET

many PB (often streamed, not all retained)

Intermediate

Extracted JSONL, MinHash sigs, scores

10–50 PB across all stages

Final

Tokenized packed shards

~60–120 TB

The collapse from tens of PB of intermediates to tens of TB of final shards is the whole point. With a 128k-vocab tokenizer the IDs do not fit in 16 bits, so token IDs are stored as uint32 (4 bytes/token): a 15–30T token corpus is ~60–120 TB — still small enough to replicate near the training cluster and mmap. (A smaller vocab that fit in uint16 would halve this to ~30–60 TB, which is the tradeoff if storage near the cluster is tight.)

Compute: O(1M–10M CPU-hours), dominated by extraction and the dedup shuffle, with weeks of wall-clock on a large CPU pool. The quality-classifier inference pass over hundreds of T tokens is the other big cost — it’s why the classifier is kept small (fastText-class).

API design

The “APIs” here are stage contracts, the declarative recipe, the loader contract, and the lineage catalog — not RPC endpoints.

Stage contract — every stage is idempotent: input prefix in object store -> output prefix + manifest.

class Stage(Protocol):

name: str

version: str # pinned; bump => cache invalidation

def run(self, in_prefix: str, out_prefix: str,

cfg: StageConfig) -> Manifest: ...

# idempotent: re-running with same (inputs, cfg, version)

# yields content-identical outputs; safe to restart.

Recipe API — the dataset is a declarative config under version control. Changing it and re-running is how you produce a new dataset version.

recipe:

version: "corpus-2026.06-rc3"

seed: 1234567

sources:

- {name: cc, snapshots: ["2024-10".."2025-12"], weight: 0.62}

- {name: code, repo: the-stack-v2, weight: 0.17}

- {name: books, license: licensed, weight: 0.06}

- {name: multilingual, langs: [es,de,zh,...], weight: 0.10}

- {name: synthetic, gen: rephrase-v2, weight: 0.05}

extract: {tool: trafilatura}

filters:

lang_id: {model: fasttext-lid, min_conf: 0.65}

heuristics: {gopher: true, c4_rules: true}

quality: {model: dclm-fasttext, keep_percentile: 0.90}

dedup:

exact: {granularity: [url, doc]}

minhash: {ngram: 5, perms: 112, bands: 14, rows: 8,

jaccard_thresh: 0.85, scope: per_snapshot}

decontam: {ngram: 13, eval_suite: "evals-v7"}

tokenizer: {name: bpe-v5, vocab: 128000}

shuffle: {global: true}

shard: {seq_len: 8192, format: packed-mmap}

Loader contract — the only thing the trainer sees. Deterministic, seekable, version-pinned.

# shard_00042.bin (uint32 token ids, packed to seq_len)

# shard_00042.idx (offsets => O(1) seek to sample i)

def get(shard_id, i) -> Tensor[seq_len] # mmap, zero-copy

# tokenizer hash + recipe version pinned in dataset header;

# resuming a run reproduces the exact sample order from (seed, step).

Lineage / catalog API — provenance and dataset versions as the source of truth.

catalog.resolve("corpus-2026.06-rc3") -> {manifests, source_hashes}

catalog.lineage(doc_id) -> [snapshot, extract_v, filter_v, dedup_cluster, ...]

catalog.takedown(url_or_hash) -> affected_versions # audit/compliance

Data model

The artifact evolves stage-by-stage; each representation is content-addressed (hash of content + producing-stage version) so identical inputs are never recomputed.

WARC (raw HTML)

-> extracted doc: {doc_id, url, text, lang?, ts, src_snapshot}

-> filtered doc: {... , lang, quality_score, heuristic_flags}

-> minhash sig: {doc_id, sig[112], band_hashes[14]}

-> dedup cluster: {cluster_id, member_doc_ids[], keep_doc_id}

-> decontam flag: {doc_id, contaminated: bool, hit_ngrams[]}

-> mixed stream: {doc_id, src, sample_weight, epoch_copies}

-> packed shard: uint32[] token ids + .idx offsets

Key stores:

Store

Purpose

Notes

Content-addressed object store

All intermediates

Dedups recompute; enables cheap ablation branches

MinHash signature store

Fuzzy dedup

Banded LSH buckets, sharded by band hash

Contamination n-gram index

Decontamination

13-gram set built from the eval suite

Versioned manifest/catalog

Source of truth

Maps recipe version -> exact artifact hashes

Three-granularity dedup (following Llama 3): url-level (drop re-crawls of the same page), document-level (MinHash near-dup), and line-level (strip boilerplate lines — nav bars, cookie banners — that recur across many docs). Each catches a different class of redundancy.

High-level architecture

A linear DAG. Most stages are embarrassingly parallel per-snapshot/per-shard; two stages — cross-shard dedup and the global shuffle — require global coordination and are where the systems difficulty concentrates.

[ object store: S3 / GCS ]

Ingestion (stream WARC, robots/license gate)

Extraction (Trafilatura: HTML -> clean main text)

Language ID (fastText LID) + heuristic filters (Gopher/C4)

Model-based quality classifier (fastText, keep ~top 10%)

Exact dedup (url/doc) + MinHash fuzzy dedup <== GLOBAL

| (banding + union-find)

Decontamination (13-gram vs eval suite)

PII / safety scrub (regex+NER PII, CSAM/toxicity gates)

Data mixing / weighting (DoReMi weights, upsample code/ML)

Tokenize + GLOBAL shuffle + pack <== GLOBAL

[ tokenized mmap shards + manifest ]

spanning all stages: manifest/lineage catalog

+ experiment tracking + per-stage metrics

Execution. Orchestrated by Slurm/Ray/Spark executors over object storage — datatrove is the open reference for exactly this (pluggable executors, streaming readers, per-stage stats). Per-snapshot work fans out across thousands of CPU workers reading WARC ranges directly from S3/GCS. Output is written under content-addressed prefixes so a preempted node’s partial output is simply overwritten on retry without corruption.

Where global coordination is unavoidable: near-duplicate dedup must compare documents across snapshots, so it’s a distributed shuffle (group by LSH band) followed by connected-components/union-find — it cannot stay embarrassingly parallel. The final global shuffle likewise must interleave samples from every source so the trainer never sees long homogeneous runs.

Deep dives

WHERE STAFF IS WON

I’ll go deep on the three hardest subsystems — trillion-scale dedup, learned quality filtering, and data mixing/curriculum — then briefly cover decontamination and the tokenize/shuffle/shard contract. The recurring Staff theme: every choice is validated by downstream ablation, and the naive version of each blows up at scale.

Deep dive A: Trillion-scale fuzzy dedup

Exact dedup is easy: hash documents, drop collisions. The hard problem is near-duplicates — boilerplate-heavy pages, mirrored content, slight reformatting — at hundreds of T tokens. All-pairs comparison is O(n^2) and impossible. The standard solution is MinHash + LSH banding + union-find.

Algorithm:

1. Shingle each document into 5-grams (word-level).

2. Compute 112 MinHash permutations per document -> a signature.

3. Split the 112 perms into 14 bands of 8 rows each. Two docs share a candidate bucket if any band matches. The candidate probability follows the S-curve 1-(1-s^r)^b, whose 50% crossover for 14x8 sits around a Jaccard of ~0.72; the transition is steep, so by ~0.85 nearly all pairs are caught and well below ~0.7 most are dropped. You tune (bands, rows) to place that steep transition near the similarity you want to call a “duplicate.”

4. Shuffle by band hash so all candidates land on the same reducer (this is the global step).

5. Run distributed union-find / connected-components over candidate edges to form clusters; keep one representative per cluster.

Parameter

Value

Effect

Shingle

5-grams

Balance: too short = false matches, too long = misses

Permutations

112

Signature precision vs cost

Bands x rows

14 x 8

Steep S-curve, ~0.72 crossover

Jaccard threshold

~0.85

Operating point where nearly all dups are caught

Staff insight — global is not always better. FineWeb’s experiments found that global (all-snapshots) MinHash dedup actually hurt downstream evals versus dedup within each snapshot independently. The reason: aggressive global dedup preferentially removes content that recurs across crawls — which is disproportionately high-quality, frequently-cited material — leaving a relatively over-represented tail of low-quality unique junk. So FineWeb deduped per-snapshot. The decision is empirical, settled by ablation, not by “more dedup = better.”

Trap: treating dedup as an unbounded global all-pairs / connected-components graph. The union-find graph itself can OOM at trillion-token scale if a single LSH bucket explodes (a popular boilerplate band can pull in millions of docs). Mitigations: cap bucket sizes, salt/partition hot bands, and process union-find in a streaming/partitioned fashion rather than materializing the full edge set in memory.

Deep dive B: Learned quality filtering

The biggest lever on capability. The hierarchy is cheap-to-expensive: language ID -> heuristics (Gopher rules: symbol ratios, bullet fraction, mean word length; C4 rules: drop pages without terminal punctuation, “lorem ipsum”, etc.) -> a model-based quality classifier.

How you train the classifier (DCLM / FineWeb-Edu pattern):

1. Positives = a proxy for “good” text: documents linked from curated sources, ELI5/OpenHermes-style instruction data, Wikipedia-referenced pages, or (FineWeb-Edu) pages an LLM rated as educational.

2. Negatives = random web text.

3. Train a fastText linear classifier (cheap enough to score hundreds of T tokens) — or distill an LLM’s educational rating into a small embedding-plus-regressor model as FineWeb-Edu does.

4. Pick the percentile cutoff (e.g. keep top 10%) by ablation, not by classifier accuracy: train proxy models at several cutoffs, evaluate downstream.

Staff insight — validate on downstream evals, never perplexity. A filter tuned to minimize held-out perplexity does not transfer to capability; perplexity rewards predictable text and can be minimized by selecting bland, repetitive content. The discipline DCLM/FineWeb codified is: train a small proxy model on each candidate subset and measure MMLU/ARC/HellaSwag/etc. FineWeb-Edu’s single educational-quality classifier produced one of the strongest small-data corpora precisely because the cutoff was eval-validated.

Filter layer

Cost

Mechanism

Language ID

very cheap

fastText LID, confidence threshold

Heuristics

cheap

Gopher + C4 hand rules

Model-based quality

moderate (the big inference pass)

fastText / distilled LLM rating, keep top ~10%

Trap — filtering bias. Aggressive learned filtering narrows the distribution toward whatever your positives look like (often Wikipedia-/English-/formal-leaning). That can erase dialects, low-resource languages, code styles, and niche domains, capping multilingual and long-tail capability. Mitigations: per-domain/per-language cutoffs rather than one global threshold, monitoring the distribution of kept data, and treating “what got removed” as a first-class metric.

Deep dive C: Data mixing & curriculum

Given clean sources, how much of each and in what order materially changes the model. Two layers: static mixture weights, and a curriculum over training.

Mixture weights — DoReMi. Rather than hand-tuning weights, DoReMi trains a small (~280M) proxy model with Group DRO to find domain weights that minimize worst-case excess loss, then applies those weights to train the large model. Practically you also deliberately upsample code and multilingual data beyond their natural web frequency because they’re high-value and scarce.

Curriculum / annealing. Llama 3 (and others) anneal on top-quality data at the end: during the final phase of pretraining, decay the learning rate toward zero while shifting the mixture toward the highest-quality, highest-value sources (curated math, code, reference). The model spends its last, most-retained gradient steps on the best data.

Knob

Technique

Reference

Domain weights

DoReMi proxy weights (280M)

DoReMi

Scarce high-value data

Upsample code / multilingual

Llama 3, common practice

End-of-training

LR -> 0 anneal on top-quality data

Llama 3

Source selection

Fewer-but-better can beat more

DCLM

Staff insight — more sources can hurt. DCLM found that naively adding more data sources sometimes lowered downstream scores — a noisy or off-distribution source dilutes the mixture. Source inclusion is itself an ablation decision, not a default-yes.

Briefly: decontamination

Build a set of 13-grams from every example in the eval suite, then flag/remove any training document containing those n-grams (Llama-3-style). Trap: skipping or under-powering decontamination silently inflates benchmark numbers — the model has memorized the test set — and you only discover it when the model fails to generalize. Decontamination must run against the exact eval suite version used for reporting, and that version is pinned in the recipe.

Briefly: tokenize, global shuffle, shard

Tokenize with a pinned BPE tokenizer (vocab ~128k), pack token streams into fixed seq_len sequences (concatenating documents with separators to avoid padding waste), then globally shuffle so consecutive samples don’t come from the same source/snapshot. Write .bin (packed uint32) + .idx (offsets) shards for O(1) seekable mmap reads. Trap: recomputing this — or the whole pipeline — per ablation burns millions of CPU-hours; content-addressed caching means an ablation that only changes the mix weights reuses every upstream artifact and only re-runs mixing + tokenize.

Multi-team rollout

Operate it like production data infrastructure, with the model-eval ablation as the release gate.

Per-stage quality gates and validation:

Per-stage data-quality metrics: survival rate, language distribution, mean quality score, dedup cluster-size distribution, token counts — tracked per stage and alerted on drift.
Schema/format validation between stages: every stage asserts its input schema so a malformed upstream output fails fast rather than silently poisoning the corpus.
Canary sampling + human spot-checks: sample N docs per stage for human review; catches extraction regressions (e.g. a Trafilatura version that starts keeping nav boilerplate) that metrics miss.

Reliability:

Checkpoint-restart on preempted/spot nodes: stages are idempotent and write to content-addressed prefixes, so a killed worker's work is safely redone.
Cost/throughput dashboards: CPU-hours per stage, $/T-tokens, queue depth — so you catch a stage that's quietly 5x over budget.

The release gate — ablation workflow. No recipe change ships on intuition. The workflow: change the recipe -> build (mostly cached) the new dataset version -> train a small proxy model -> evaluate on the downstream suite -> compare to the current recipe. Only an improvement (or neutral-with-other-benefit) ships. This is the analog of CI/CD for data.

Compliance & reproducibility:

Dataset versions and recipe configs live in source control; the manifest makes any version byte-reproducible from (recipe, seed).
Audit/takedown handling: lineage lets you answer "is this URL in version X?" and remove it, producing a new compliant version — required for licensing disputes, PII requests, and opt-outs.

Bottlenecks & evolution

Current limits:

Limit

Why it bites

Direction

Unique-token scarcity

High-quality web text is finite vs growing token budgets

Synthetic/rephrased data, multi-epoch, multilingual expansion

Dedup connected-components scaling

Union-find graph/hot buckets OOM at trillion scale

Partitioned/streaming union-find, bucket caps

Quality-classifier bias

Aggressive filtering narrows the distribution

Per-domain cutoffs, distribution monitoring

Loader as train-time bottleneck

If shards can't saturate 10k+ GPUs, compute idles

Prefetch, sharding to avoid hot keys, near-cluster replication

Observability: the through-line is that the metric that matters is downstream eval movement, so the most important “monitor” is a tight quality-to-eval feedback loop — cheap proxy-model ablations run continuously as the recipe evolves, not just at release.

Evolution — the shift from “more data” to “better data”:

Model-based filtering everywhere (FineWeb-Edu, DCLM) is now the default, not an enhancement.
Synthetic and rephrased data to escape unique-token scarcity and to densify scarce skills (math, code, reasoning).
Multimodal corpora (image/audio/video-text) as the same pipeline generalizes beyond text.
Continual ingestion of new crawls to keep the corpus fresh.
Scaling-law-driven token budgets: the token count is chosen from compute-optimal scaling laws, and the bet has decisively moved from accumulating more tokens to curating better ones.

✓

Summary

1. Data quality is the dominant lever on model capability — for pretraining, the corpus, not the architecture, is most of the story. Aggressive eval-validated filtering (keep ~top 10%, per DCLM) beats far larger unfiltered pools.

2. It’s an offline, reproducible, ablation-driven batch pipeline whose core hard problems are trillion-scale fuzzy dedup (MinHash banding + distributed union-find), learned quality filtering (fastText/distilled classifier with an eval-chosen cutoff), rigorous decontamination (13-gram vs the pinned eval suite), deliberate mixing/curriculum (DoReMi weights, code/multilingual upsampling, end-of-training anneal), and a tokenized-shard contract that never starves the cluster.

3. Treat the dataset as a versioned recipe with content-addressed intermediates, manifests/lineage, and idempotent checkpoint-restart stages — so ablations are cheap and any version is byte-reproducible and auditable. Strong answers cite FineWeb/FineWeb-Edu, DCLM, Llama 3, and datatrove as current practice.

4. Prove every choice with downstream evals, not perplexity. Perplexity-tuned filters don’t transfer; global dedup can hurt; more sources can hurt; skipping decontamination silently inflates benchmarks. “Correct” here means a model trains measurably better — provable only with controlled small-model experiments.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Pipeline & orchestration

Lists ingest→filter→tokenize stages.

Designs a DAG of idempotent, restartable, content-addressed stages over Slurm/Ray/Spark with checkpoint-restart on preempted nodes.

Deduplication

Removes exact duplicates by hash.

Layers Bloom/exact then MinHash-LSH (5-grams, ~112 hashes, 0.85 Jaccard) + union-find; per-snapshot first, knows global dedup can hurt.

Quality filtering

Applies length/symbol heuristics.

Adds a learned classifier (top ~10%), validates on downstream benchmarks not perplexity, and watches for distribution bias.

Decontamination

Mentions removing test data.

Runs 13-gram overlap against all reported eval sets with detect-then-confirm, logged per-benchmark and tied to the recipe version.

Data mixing & curriculum

Mixes sources roughly equally.

Sets DoReMi proxy weights, upsamples code/multilingual, anneals on top-quality data at the end; knows more sources can hurt.

Reproducibility & cost

Notes the pipeline is expensive.

Versions every artifact as a recipe and caches intermediates so ablations re-run only the cheap tail, saving millions of CPU-hours.

Loader contract

Tokenizes the corpus.

Pre-tokenizes, globally shuffles, and writes packed indexed mmap shards so the loader saturates 10k+ GPUs with <1% idle.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →