Design an LLM Pretraining Data Pipeline
A Staff-level walkthrough of the LLM pretraining data pipeline that Anthropic, OpenAI, and Google DeepMind run to produce frontier training corpora. It is an offline, reproducible, ablation-driven batch system whose core hard problems are trillion-scale fuzzy dedup, learned quality filtering, rigorous benchmark decontamination, deliberate data mixing/curriculum, and a tokenized-shard contract that never starves the training cluster — with every curation choice proven by downstream evals rather than perplexity.
Scope & ambiguity
Let me frame what we’re actually building. The goal is the petabyte-scale offline batch system that turns raw web crawls into a clean, deduplicated, decontaminated, mixed, tokenized multi-trillion-token corpus that a training cluster can stream without ever becoming the bottleneck. This is not a serving system and not the training loop — it’s a one-directional ETL DAG that runs for weeks and produces a versioned dataset artifact. The defining property is that success is measured by downstream model evals via controlled ablations, not by SLAs, row counts, or perplexity. So I’ll treat the dataset as a versioned recipe, and every curation choice I make has to be provable by training a small proxy model and seeing benchmarks move.
This is the kind of system frontier labs — Anthropic, OpenAI, Google DeepMind, Meta — build internally to produce their pretraining corpora, and it’s a common AI-infrastructure interview topic because it sits exactly at the seam between classic distributed systems and ML. The open references that anchor current practice are FineWeb / FineWeb-Edu, DCLM (DataComp-LM), Llama 3’s data section, and HuggingFace’s datatrove pipeline library. I’ll cite them throughout because “I know what’s published” is a real signal here.
Scope boundaries:
- In scope: ingestion of raw crawls -> text extraction -> filtering -> dedup -> decontamination -> PII/safety scrub -> data mixing -> tokenization + global shuffle + sharding. The output is a set of tokenized, packed, seekable shards plus a manifest.
- Out of scope: the training loop itself, optimizer/parallelism, RLHF / post-training / preference data, and the inference stack. I'll mention the data-loader contract because it constrains the output format, but I won't design the trainer.
- Sources: CommonCrawl web snapshots (the bulk), code (GitHub-derived, e.g. The Stack), licensed/high-quality text (books, papers, reference), and increasingly synthetic/rephrased data.
- Success criterion: a model trained on this corpus scores better on a held-out downstream eval suite (MMLU, HellaSwag, ARC, GSM8K, HumanEval, etc.) than one trained on the previous recipe, at matched compute.
Who asks this & what they probe
Requirements
Functional requirements:
- Ingest raw crawls (CommonCrawl WARC/WAT/WET), code repos, and licensed corpora from object storage, respecting robots/licensing.
- Extract clean text from raw HTML (boilerplate removal, main-content extraction).
- Filter in a hierarchy: language ID, cheap heuristic/quality rules, then a learned model-based quality classifier.
- Deduplicate at url / document / line granularity, including trillion-scale fuzzy (near-duplicate) dedup.
- Decontaminate against the benchmark eval suite so reported numbers are trustworthy.
- Scrub PII and apply safety filters (CSAM hashes, toxicity gates where appropriate).
- Mix sources with deliberate weights and a curriculum (upsampling, annealing).
- Tokenize, globally shuffle, and pack into shards.
Non-functional requirements:
- Reproducibility & versioning: a recipe + seed must reproduce a byte-identical dataset; every artifact is addressable and pinned.
- Lineage / provenance: for any token, I can trace back to the source document, snapshot, and which stages touched it — needed for audits and takedowns.
- Compliance: robots.txt and license respect, opt-out/takedown handling, PII removal.
- Fault tolerance: runs on preemptible/spot nodes for weeks; stages must be idempotent and checkpoint-restartable.
- Cost ceiling: the full pipeline is O(1M–10M CPU-hours); ablations must not require recomputing it, so heavy caching of content-addressed intermediates is mandatory.
The two requirements that define this system:
Back-of-envelope estimation
Input pool. CommonCrawl publishes a snapshot roughly monthly; there are ~90–100 usable snapshots, each ~7 PB of raw WARC. Naively that’s hundreds of PB of raw bytes, but snapshots overlap heavily. After extraction and global dedup the unique raw text pool lands around 200–240 trillion tokens — DCLM operates over a pool of ~240T tokens, which is the canonical number to anchor on.
DCLM-Baseline’s headline result is that aggressive model-based filtering keeping roughly the top ~10% of documents beats much larger unfiltered pools — that’s the “better data beats more data” thesis in one number.
Storage budget:
The collapse from tens of PB of intermediates to tens of TB of final shards is the whole point. With a 128k-vocab tokenizer the IDs do not fit in 16 bits, so token IDs are stored as uint32 (4 bytes/token): a 15–30T token corpus is ~60–120 TB — still small enough to replicate near the training cluster and mmap. (A smaller vocab that fit in uint16 would halve this to ~30–60 TB, which is the tradeoff if storage near the cluster is tight.)
Compute: O(1M–10M CPU-hours), dominated by extraction and the dedup shuffle, with weeks of wall-clock on a large CPU pool. The quality-classifier inference pass over hundreds of T tokens is the other big cost — it’s why the classifier is kept small (fastText-class).
API design
The “APIs” here are stage contracts, the declarative recipe, the loader contract, and the lineage catalog — not RPC endpoints.
Stage contract — every stage is idempotent: input prefix in object store -> output prefix + manifest.
Recipe API — the dataset is a declarative config under version control. Changing it and re-running is how you produce a new dataset version.
Loader contract — the only thing the trainer sees. Deterministic, seekable, version-pinned.
Lineage / catalog API — provenance and dataset versions as the source of truth.
Data model
The artifact evolves stage-by-stage; each representation is content-addressed (hash of content + producing-stage version) so identical inputs are never recomputed.
Key stores:
Three-granularity dedup (following Llama 3): url-level (drop re-crawls of the same page), document-level (MinHash near-dup), and line-level (strip boilerplate lines — nav bars, cookie banners — that recur across many docs). Each catches a different class of redundancy.
High-level architecture
A linear DAG. Most stages are embarrassingly parallel per-snapshot/per-shard; two stages — cross-shard dedup and the global shuffle — require global coordination and are where the systems difficulty concentrates.
Execution. Orchestrated by Slurm/Ray/Spark executors over object storage — datatrove is the open reference for exactly this (pluggable executors, streaming readers, per-stage stats). Per-snapshot work fans out across thousands of CPU workers reading WARC ranges directly from S3/GCS. Output is written under content-addressed prefixes so a preempted node’s partial output is simply overwritten on retry without corruption.
Where global coordination is unavoidable: near-duplicate dedup must compare documents across snapshots, so it’s a distributed shuffle (group by LSH band) followed by connected-components/union-find — it cannot stay embarrassingly parallel. The final global shuffle likewise must interleave samples from every source so the trainer never sees long homogeneous runs.
Deep dives
WHERE STAFF IS WONI’ll go deep on the three hardest subsystems — trillion-scale dedup, learned quality filtering, and data mixing/curriculum — then briefly cover decontamination and the tokenize/shuffle/shard contract. The recurring Staff theme: every choice is validated by downstream ablation, and the naive version of each blows up at scale.
Deep dive A: Trillion-scale fuzzy dedup
Exact dedup is easy: hash documents, drop collisions. The hard problem is near-duplicates — boilerplate-heavy pages, mirrored content, slight reformatting — at hundreds of T tokens. All-pairs comparison is O(n^2) and impossible. The standard solution is MinHash + LSH banding + union-find.
Algorithm:
1. Shingle each document into 5-grams (word-level).
2. Compute 112 MinHash permutations per document -> a signature.
3. Split the 112 perms into 14 bands of 8 rows each. Two docs share a candidate bucket if any band matches. The candidate probability follows the S-curve 1-(1-s^r)^b, whose 50% crossover for 14x8 sits around a Jaccard of ~0.72; the transition is steep, so by ~0.85 nearly all pairs are caught and well below ~0.7 most are dropped. You tune (bands, rows) to place that steep transition near the similarity you want to call a “duplicate.”
4. Shuffle by band hash so all candidates land on the same reducer (this is the global step).
5. Run distributed union-find / connected-components over candidate edges to form clusters; keep one representative per cluster.
Staff insight — global is not always better. FineWeb’s experiments found that global (all-snapshots) MinHash dedup actually hurt downstream evals versus dedup within each snapshot independently. The reason: aggressive global dedup preferentially removes content that recurs across crawls — which is disproportionately high-quality, frequently-cited material — leaving a relatively over-represented tail of low-quality unique junk. So FineWeb deduped per-snapshot. The decision is empirical, settled by ablation, not by “more dedup = better.”
Trap: treating dedup as an unbounded global all-pairs / connected-components graph. The union-find graph itself can OOM at trillion-token scale if a single LSH bucket explodes (a popular boilerplate band can pull in millions of docs). Mitigations: cap bucket sizes, salt/partition hot bands, and process union-find in a streaming/partitioned fashion rather than materializing the full edge set in memory.
Deep dive B: Learned quality filtering
The biggest lever on capability. The hierarchy is cheap-to-expensive: language ID -> heuristics (Gopher rules: symbol ratios, bullet fraction, mean word length; C4 rules: drop pages without terminal punctuation, “lorem ipsum”, etc.) -> a model-based quality classifier.
How you train the classifier (DCLM / FineWeb-Edu pattern):
1. Positives = a proxy for “good” text: documents linked from curated sources, ELI5/OpenHermes-style instruction data, Wikipedia-referenced pages, or (FineWeb-Edu) pages an LLM rated as educational.
2. Negatives = random web text.
3. Train a fastText linear classifier (cheap enough to score hundreds of T tokens) — or distill an LLM’s educational rating into a small embedding-plus-regressor model as FineWeb-Edu does.
4. Pick the percentile cutoff (e.g. keep top 10%) by ablation, not by classifier accuracy: train proxy models at several cutoffs, evaluate downstream.
Staff insight — validate on downstream evals, never perplexity. A filter tuned to minimize held-out perplexity does not transfer to capability; perplexity rewards predictable text and can be minimized by selecting bland, repetitive content. The discipline DCLM/FineWeb codified is: train a small proxy model on each candidate subset and measure MMLU/ARC/HellaSwag/etc. FineWeb-Edu’s single educational-quality classifier produced one of the strongest small-data corpora precisely because the cutoff was eval-validated.
Trap — filtering bias. Aggressive learned filtering narrows the distribution toward whatever your positives look like (often Wikipedia-/English-/formal-leaning). That can erase dialects, low-resource languages, code styles, and niche domains, capping multilingual and long-tail capability. Mitigations: per-domain/per-language cutoffs rather than one global threshold, monitoring the distribution of kept data, and treating “what got removed” as a first-class metric.
Deep dive C: Data mixing & curriculum
Given clean sources, how much of each and in what order materially changes the model. Two layers: static mixture weights, and a curriculum over training.
Mixture weights — DoReMi. Rather than hand-tuning weights, DoReMi trains a small (~280M) proxy model with Group DRO to find domain weights that minimize worst-case excess loss, then applies those weights to train the large model. Practically you also deliberately upsample code and multilingual data beyond their natural web frequency because they’re high-value and scarce.
Curriculum / annealing. Llama 3 (and others) anneal on top-quality data at the end: during the final phase of pretraining, decay the learning rate toward zero while shifting the mixture toward the highest-quality, highest-value sources (curated math, code, reference). The model spends its last, most-retained gradient steps on the best data.
Staff insight — more sources can hurt. DCLM found that naively adding more data sources sometimes lowered downstream scores — a noisy or off-distribution source dilutes the mixture. Source inclusion is itself an ablation decision, not a default-yes.
Briefly: decontamination
Build a set of 13-grams from every example in the eval suite, then flag/remove any training document containing those n-grams (Llama-3-style). Trap: skipping or under-powering decontamination silently inflates benchmark numbers — the model has memorized the test set — and you only discover it when the model fails to generalize. Decontamination must run against the exact eval suite version used for reporting, and that version is pinned in the recipe.
Briefly: tokenize, global shuffle, shard
Tokenize with a pinned BPE tokenizer (vocab ~128k), pack token streams into fixed seq_len sequences (concatenating documents with separators to avoid padding waste), then globally shuffle so consecutive samples don’t come from the same source/snapshot. Write .bin (packed uint32) + .idx (offsets) shards for O(1) seekable mmap reads. Trap: recomputing this — or the whole pipeline — per ablation burns millions of CPU-hours; content-addressed caching means an ablation that only changes the mix weights reuses every upstream artifact and only re-runs mixing + tokenize.
Multi-team rollout
Operate it like production data infrastructure, with the model-eval ablation as the release gate.
Per-stage quality gates and validation:
- Per-stage data-quality metrics: survival rate, language distribution, mean quality score, dedup cluster-size distribution, token counts — tracked per stage and alerted on drift.
- Schema/format validation between stages: every stage asserts its input schema so a malformed upstream output fails fast rather than silently poisoning the corpus.
- Canary sampling + human spot-checks: sample N docs per stage for human review; catches extraction regressions (e.g. a Trafilatura version that starts keeping nav boilerplate) that metrics miss.
Reliability:
- Checkpoint-restart on preempted/spot nodes: stages are idempotent and write to content-addressed prefixes, so a killed worker's work is safely redone.
- Cost/throughput dashboards: CPU-hours per stage, $/T-tokens, queue depth — so you catch a stage that's quietly 5x over budget.
The release gate — ablation workflow. No recipe change ships on intuition. The workflow: change the recipe -> build (mostly cached) the new dataset version -> train a small proxy model -> evaluate on the downstream suite -> compare to the current recipe. Only an improvement (or neutral-with-other-benefit) ships. This is the analog of CI/CD for data.
Compliance & reproducibility:
- Dataset versions and recipe configs live in source control; the manifest makes any version byte-reproducible from (recipe, seed).
- Audit/takedown handling: lineage lets you answer "is this URL in version X?" and remove it, producing a new compliant version — required for licensing disputes, PII requests, and opt-outs.
Bottlenecks & evolution
Current limits:
Observability: the through-line is that the metric that matters is downstream eval movement, so the most important “monitor” is a tight quality-to-eval feedback loop — cheap proxy-model ablations run continuously as the recipe evolves, not just at release.
Evolution — the shift from “more data” to “better data”:
- Model-based filtering everywhere (FineWeb-Edu, DCLM) is now the default, not an enhancement.
- Synthetic and rephrased data to escape unique-token scarcity and to densify scarce skills (math, code, reasoning).
- Multimodal corpora (image/audio/video-text) as the same pipeline generalizes beyond text.
- Continual ingestion of new crawls to keep the corpus fresh.
- Scaling-law-driven token budgets: the token count is chosen from compute-optimal scaling laws, and the bet has decisively moved from accumulating more tokens to curating better ones.
Summary
1. Data quality is the dominant lever on model capability — for pretraining, the corpus, not the architecture, is most of the story. Aggressive eval-validated filtering (keep ~top 10%, per DCLM) beats far larger unfiltered pools.
2. It’s an offline, reproducible, ablation-driven batch pipeline whose core hard problems are trillion-scale fuzzy dedup (MinHash banding + distributed union-find), learned quality filtering (fastText/distilled classifier with an eval-chosen cutoff), rigorous decontamination (13-gram vs the pinned eval suite), deliberate mixing/curriculum (DoReMi weights, code/multilingual upsampling, end-of-training anneal), and a tokenized-shard contract that never starves the cluster.
3. Treat the dataset as a versioned recipe with content-addressed intermediates, manifests/lineage, and idempotent checkpoint-restart stages — so ablations are cheap and any version is byte-reproducible and auditable. Strong answers cite FineWeb/FineWeb-Edu, DCLM, Llama 3, and datatrove as current practice.
4. Prove every choice with downstream evals, not perplexity. Perplexity-tuned filters don’t transfer; global dedup can hurt; more sources can hurt; skipping decontamination silently inflates benchmarks. “Correct” here means a model trains measurably better — provable only with controlled small-model experiments.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.