← Back to all questions
AI System DesignStaffPretraining DataData Pipeline

Design an LLM Pretraining Data Pipeline

A Staff-level walkthrough of the LLM pretraining data pipeline that Anthropic, OpenAI, and Google DeepMind run to produce frontier training corpora. It is an offline, reproducible, ablation-driven batch system whose core hard problems are trillion-scale fuzzy dedup, learned quality filtering, rigorous benchmark decontamination, deliberate data mixing/curriculum, and a tokenized-shard contract that never starves the training cluster — with every curation choice proven by downstream evals rather than perplexity.

Level
Staff
Category
AI Infrastructure · Data Curation
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Trillion-scale dedup: exact/Bloom first pass then MinHash-LSH + union-find connected components
·Filtering hierarchy: cheap heuristics then a learned model-based quality classifier
·Benchmark decontamination via n-gram (13-gram) overlap against eval sets
·Tokenize + global shuffle into packed mmap shards under a data-loader SLA that never starves the cluster
★ STAFF-LEVEL SIGNALS
Validates every curation choice with downstream-eval ablations, not perplexity or held-out loss
Materializes and versions every stage as a content-addressed, reproducible dataset 'recipe'
Knows naive cross-snapshot dedup and naive extra 'high-quality' mixing can hurt (FineWeb/DCLM findings)
Designs data mixing + end-of-training annealing on top-quality data (DoReMi weights, Llama 3 anneal)
0

Scope & ambiguity

Let me frame what we’re actually building. The goal is the petabyte-scale offline batch system that turns raw web crawls into a clean, deduplicated, decontaminated, mixed, tokenized multi-trillion-token corpus that a training cluster can stream without ever becoming the bottleneck. This is not a serving system and not the training loop — it’s a one-directional ETL DAG that runs for weeks and produces a versioned dataset artifact. The defining property is that success is measured by downstream model evals via controlled ablations, not by SLAs, row counts, or perplexity. So I’ll treat the dataset as a versioned recipe, and every curation choice I make has to be provable by training a small proxy model and seeing benchmarks move.

This is the kind of system frontier labs — Anthropic, OpenAI, Google DeepMind, Meta — build internally to produce their pretraining corpora, and it’s a common AI-infrastructure interview topic because it sits exactly at the seam between classic distributed systems and ML. The open references that anchor current practice are FineWeb / FineWeb-Edu, DCLM (DataComp-LM), Llama 3’s data section, and HuggingFace’s datatrove pipeline library. I’ll cite them throughout because “I know what’s published” is a real signal here.

Scope boundaries:

  • In scope: ingestion of raw crawls -> text extraction -> filtering -> dedup -> decontamination -> PII/safety scrub -> data mixing -> tokenization + global shuffle + sharding. The output is a set of tokenized, packed, seekable shards plus a manifest.
  • Out of scope: the training loop itself, optimizer/parallelism, RLHF / post-training / preference data, and the inference stack. I'll mention the data-loader contract because it constrains the output format, but I won't design the trainer.
  • Sources: CommonCrawl web snapshots (the bulk), code (GitHub-derived, e.g. The Stack), licensed/high-quality text (books, papers, reference), and increasingly synthetic/rephrased data.
  • Success criterion: a model trained on this corpus scores better on a held-out downstream eval suite (MMLU, HellaSwag, ARC, GSM8K, HumanEval, etc.) than one trained on the previous recipe, at matched compute.

Who asks this & what they probe

Role
Focus
What they probe
SDE
Offline batch/ETL system
Orchestration of an idempotent restartable DAG over Slurm/Ray/Spark; object-store I/O and shard design to avoid hot keys; dedup as a distributed shuffle + union-find; content-addressed intermediates; versioning, lineage, checkpoint-restart on preempted nodes; caching that makes ablations affordable
MLE
Data quality as the capability driver
The filtering hierarchy (heuristic vs learned); how you train the quality classifier and pick the percentile cutoff; dedup threshold vs diversity; decontamination rigor; data mixing / DoReMi / curriculum; validating with downstream ablations not perplexity; synthetic-data and filtering-bias risks
Switcher (SDE to AI)
ETL instincts transfer; learn the ML stages
Maps Spark/Ray/distributed-systems knowledge directly onto the pipeline, then internalizes the new vocabulary: model-based filtering, MinHash fuzzy dedup, benchmark decontamination, DoReMi mixing, tokenize + global shuffle + packed shards — and the shift that "correct" means a model trains better, provable only by ablation
1

Requirements

Functional requirements:

  • Ingest raw crawls (CommonCrawl WARC/WAT/WET), code repos, and licensed corpora from object storage, respecting robots/licensing.
  • Extract clean text from raw HTML (boilerplate removal, main-content extraction).
  • Filter in a hierarchy: language ID, cheap heuristic/quality rules, then a learned model-based quality classifier.
  • Deduplicate at url / document / line granularity, including trillion-scale fuzzy (near-duplicate) dedup.
  • Decontaminate against the benchmark eval suite so reported numbers are trustworthy.
  • Scrub PII and apply safety filters (CSAM hashes, toxicity gates where appropriate).
  • Mix sources with deliberate weights and a curriculum (upsampling, annealing).
  • Tokenize, globally shuffle, and pack into shards.

Non-functional requirements:

  • Reproducibility & versioning: a recipe + seed must reproduce a byte-identical dataset; every artifact is addressable and pinned.
  • Lineage / provenance: for any token, I can trace back to the source document, snapshot, and which stages touched it — needed for audits and takedowns.
  • Compliance: robots.txt and license respect, opt-out/takedown handling, PII removal.
  • Fault tolerance: runs on preemptible/spot nodes for weeks; stages must be idempotent and checkpoint-restartable.
  • Cost ceiling: the full pipeline is O(1M–10M CPU-hours); ablations must not require recomputing it, so heavy caching of content-addressed intermediates is mandatory.

The two requirements that define this system:

Requirement
Why it dominates the design
Data-loader must never starve 10k+ GPUs
The output contract (seekable, packed, mmap shards) is the real "API"; if the loader stalls, millions of dollars of GPUs idle
Every curation choice validated by ablation
"Correctness" is downstream eval movement, not row counts or perplexity; this forces small-proxy-model experiments as a first-class workflow
2

Back-of-envelope estimation

Input pool. CommonCrawl publishes a snapshot roughly monthly; there are ~90–100 usable snapshots, each ~7 PB of raw WARC. Naively that’s hundreds of PB of raw bytes, but snapshots overlap heavily. After extraction and global dedup the unique raw text pool lands around 200–240 trillion tokens — DCLM operates over a pool of ~240T tokens, which is the canonical number to anchor on.

Stage
Surviving fraction
Resulting scale
Raw WARC (per snapshot)
~7 PB x ~95 snapshots
Text extraction (WET-quality)
~10–20% of bytes
hundreds of TB text/snapshot
Language + heuristic filters
keep ~8–15% of bytes
~200–240T token pool
MinHash fuzzy dedup
remove ~50–70%
~70–120T tokens
Model-based quality filter
keep ~top 10% of docs
~10–30T tokens
Final mixed/tokenized corpus
+ upsampling/synthetic
~15–30T tokens

DCLM-Baseline’s headline result is that aggressive model-based filtering keeping roughly the top ~10% of documents beats much larger unfiltered pools — that’s the “better data beats more data” thesis in one number.

Storage budget:

Tier
Contents
Size
Raw
Cached WARC/WET
many PB (often streamed, not all retained)
Intermediate
Extracted JSONL, MinHash sigs, scores
10–50 PB across all stages
Final
Tokenized packed shards
~60–120 TB

The collapse from tens of PB of intermediates to tens of TB of final shards is the whole point. With a 128k-vocab tokenizer the IDs do not fit in 16 bits, so token IDs are stored as uint32 (4 bytes/token): a 15–30T token corpus is ~60–120 TB — still small enough to replicate near the training cluster and mmap. (A smaller vocab that fit in uint16 would halve this to ~30–60 TB, which is the tradeoff if storage near the cluster is tight.)

Compute: O(1M–10M CPU-hours), dominated by extraction and the dedup shuffle, with weeks of wall-clock on a large CPU pool. The quality-classifier inference pass over hundreds of T tokens is the other big cost — it’s why the classifier is kept small (fastText-class).

3

API design

The “APIs” here are stage contracts, the declarative recipe, the loader contract, and the lineage catalog — not RPC endpoints.

Stage contract — every stage is idempotent: input prefix in object store -> output prefix + manifest.

class Stage(Protocol):
name: str
version: str # pinned; bump => cache invalidation
def run(self, in_prefix: str, out_prefix: str,
cfg: StageConfig) -> Manifest: ...
# idempotent: re-running with same (inputs, cfg, version)
# yields content-identical outputs; safe to restart.

Recipe API — the dataset is a declarative config under version control. Changing it and re-running is how you produce a new dataset version.

recipe:
version: "corpus-2026.06-rc3"
seed: 1234567
sources:
- {name: cc, snapshots: ["2024-10".."2025-12"], weight: 0.62}
- {name: code, repo: the-stack-v2, weight: 0.17}
- {name: books, license: licensed, weight: 0.06}
- {name: multilingual, langs: [es,de,zh,...], weight: 0.10}
- {name: synthetic, gen: rephrase-v2, weight: 0.05}
extract: {tool: trafilatura}
filters:
lang_id: {model: fasttext-lid, min_conf: 0.65}
heuristics: {gopher: true, c4_rules: true}
quality: {model: dclm-fasttext, keep_percentile: 0.90}
dedup:
exact: {granularity: [url, doc]}
minhash: {ngram: 5, perms: 112, bands: 14, rows: 8,
jaccard_thresh: 0.85, scope: per_snapshot}
decontam: {ngram: 13, eval_suite: "evals-v7"}
tokenizer: {name: bpe-v5, vocab: 128000}
shuffle: {global: true}
shard: {seq_len: 8192, format: packed-mmap}

Loader contract — the only thing the trainer sees. Deterministic, seekable, version-pinned.

# shard_00042.bin (uint32 token ids, packed to seq_len)
# shard_00042.idx (offsets => O(1) seek to sample i)
def get(shard_id, i) -> Tensor[seq_len] # mmap, zero-copy
# tokenizer hash + recipe version pinned in dataset header;
# resuming a run reproduces the exact sample order from (seed, step).

Lineage / catalog API — provenance and dataset versions as the source of truth.

catalog.resolve("corpus-2026.06-rc3") -> {manifests, source_hashes}
catalog.lineage(doc_id) -> [snapshot, extract_v, filter_v, dedup_cluster, ...]
catalog.takedown(url_or_hash) -> affected_versions # audit/compliance
4

Data model

The artifact evolves stage-by-stage; each representation is content-addressed (hash of content + producing-stage version) so identical inputs are never recomputed.

WARC (raw HTML)
-> extracted doc: {doc_id, url, text, lang?, ts, src_snapshot}
-> filtered doc: {... , lang, quality_score, heuristic_flags}
-> minhash sig: {doc_id, sig[112], band_hashes[14]}
-> dedup cluster: {cluster_id, member_doc_ids[], keep_doc_id}
-> decontam flag: {doc_id, contaminated: bool, hit_ngrams[]}
-> mixed stream: {doc_id, src, sample_weight, epoch_copies}
-> packed shard: uint32[] token ids + .idx offsets

Key stores:

Store
Purpose
Notes
Content-addressed object store
All intermediates
Dedups recompute; enables cheap ablation branches
MinHash signature store
Fuzzy dedup
Banded LSH buckets, sharded by band hash
Contamination n-gram index
Decontamination
13-gram set built from the eval suite
Versioned manifest/catalog
Source of truth
Maps recipe version -> exact artifact hashes

Three-granularity dedup (following Llama 3): url-level (drop re-crawls of the same page), document-level (MinHash near-dup), and line-level (strip boilerplate lines — nav bars, cookie banners — that recur across many docs). Each catches a different class of redundancy.

5

High-level architecture

A linear DAG. Most stages are embarrassingly parallel per-snapshot/per-shard; two stages — cross-shard dedup and the global shuffle — require global coordination and are where the systems difficulty concentrates.

[ object store: S3 / GCS ]
|
Ingestion (stream WARC, robots/license gate)
|
Extraction (Trafilatura: HTML -> clean main text)
|
Language ID (fastText LID) + heuristic filters (Gopher/C4)
|
Model-based quality classifier (fastText, keep ~top 10%)
|
Exact dedup (url/doc) + MinHash fuzzy dedup <== GLOBAL
| (banding + union-find)
Decontamination (13-gram vs eval suite)
|
PII / safety scrub (regex+NER PII, CSAM/toxicity gates)
|
Data mixing / weighting (DoReMi weights, upsample code/ML)
|
Tokenize + GLOBAL shuffle + pack <== GLOBAL
|
[ tokenized mmap shards + manifest ]
 
spanning all stages: manifest/lineage catalog
+ experiment tracking + per-stage metrics

Execution. Orchestrated by Slurm/Ray/Spark executors over object storage — datatrove is the open reference for exactly this (pluggable executors, streaming readers, per-stage stats). Per-snapshot work fans out across thousands of CPU workers reading WARC ranges directly from S3/GCS. Output is written under content-addressed prefixes so a preempted node’s partial output is simply overwritten on retry without corruption.

Where global coordination is unavoidable: near-duplicate dedup must compare documents across snapshots, so it’s a distributed shuffle (group by LSH band) followed by connected-components/union-find — it cannot stay embarrassingly parallel. The final global shuffle likewise must interleave samples from every source so the trainer never sees long homogeneous runs.

6

Deep dives

WHERE STAFF IS WON

I’ll go deep on the three hardest subsystems — trillion-scale dedup, learned quality filtering, and data mixing/curriculum — then briefly cover decontamination and the tokenize/shuffle/shard contract. The recurring Staff theme: every choice is validated by downstream ablation, and the naive version of each blows up at scale.

Deep dive A: Trillion-scale fuzzy dedup

Exact dedup is easy: hash documents, drop collisions. The hard problem is near-duplicates — boilerplate-heavy pages, mirrored content, slight reformatting — at hundreds of T tokens. All-pairs comparison is O(n^2) and impossible. The standard solution is MinHash + LSH banding + union-find.

Algorithm:

1. Shingle each document into 5-grams (word-level).

2. Compute 112 MinHash permutations per document -> a signature.

3. Split the 112 perms into 14 bands of 8 rows each. Two docs share a candidate bucket if any band matches. The candidate probability follows the S-curve 1-(1-s^r)^b, whose 50% crossover for 14x8 sits around a Jaccard of ~0.72; the transition is steep, so by ~0.85 nearly all pairs are caught and well below ~0.7 most are dropped. You tune (bands, rows) to place that steep transition near the similarity you want to call a “duplicate.”

4. Shuffle by band hash so all candidates land on the same reducer (this is the global step).

5. Run distributed union-find / connected-components over candidate edges to form clusters; keep one representative per cluster.

Parameter
Value
Effect
Shingle
5-grams
Balance: too short = false matches, too long = misses
Permutations
112
Signature precision vs cost
Bands x rows
14 x 8
Steep S-curve, ~0.72 crossover
Jaccard threshold
~0.85
Operating point where nearly all dups are caught

Staff insight — global is not always better. FineWeb’s experiments found that global (all-snapshots) MinHash dedup actually hurt downstream evals versus dedup within each snapshot independently. The reason: aggressive global dedup preferentially removes content that recurs across crawls — which is disproportionately high-quality, frequently-cited material — leaving a relatively over-represented tail of low-quality unique junk. So FineWeb deduped per-snapshot. The decision is empirical, settled by ablation, not by “more dedup = better.”

Trap: treating dedup as an unbounded global all-pairs / connected-components graph. The union-find graph itself can OOM at trillion-token scale if a single LSH bucket explodes (a popular boilerplate band can pull in millions of docs). Mitigations: cap bucket sizes, salt/partition hot bands, and process union-find in a streaming/partitioned fashion rather than materializing the full edge set in memory.

Deep dive B: Learned quality filtering

The biggest lever on capability. The hierarchy is cheap-to-expensive: language ID -> heuristics (Gopher rules: symbol ratios, bullet fraction, mean word length; C4 rules: drop pages without terminal punctuation, “lorem ipsum”, etc.) -> a model-based quality classifier.

How you train the classifier (DCLM / FineWeb-Edu pattern):

1. Positives = a proxy for “good” text: documents linked from curated sources, ELI5/OpenHermes-style instruction data, Wikipedia-referenced pages, or (FineWeb-Edu) pages an LLM rated as educational.

2. Negatives = random web text.

3. Train a fastText linear classifier (cheap enough to score hundreds of T tokens) — or distill an LLM’s educational rating into a small embedding-plus-regressor model as FineWeb-Edu does.

4. Pick the percentile cutoff (e.g. keep top 10%) by ablation, not by classifier accuracy: train proxy models at several cutoffs, evaluate downstream.

Staff insight — validate on downstream evals, never perplexity. A filter tuned to minimize held-out perplexity does not transfer to capability; perplexity rewards predictable text and can be minimized by selecting bland, repetitive content. The discipline DCLM/FineWeb codified is: train a small proxy model on each candidate subset and measure MMLU/ARC/HellaSwag/etc. FineWeb-Edu’s single educational-quality classifier produced one of the strongest small-data corpora precisely because the cutoff was eval-validated.

Filter layer
Cost
Mechanism
Language ID
very cheap
fastText LID, confidence threshold
Heuristics
cheap
Gopher + C4 hand rules
Model-based quality
moderate (the big inference pass)
fastText / distilled LLM rating, keep top ~10%

Trap — filtering bias. Aggressive learned filtering narrows the distribution toward whatever your positives look like (often Wikipedia-/English-/formal-leaning). That can erase dialects, low-resource languages, code styles, and niche domains, capping multilingual and long-tail capability. Mitigations: per-domain/per-language cutoffs rather than one global threshold, monitoring the distribution of kept data, and treating “what got removed” as a first-class metric.

Deep dive C: Data mixing & curriculum

Given clean sources, how much of each and in what order materially changes the model. Two layers: static mixture weights, and a curriculum over training.

Mixture weights — DoReMi. Rather than hand-tuning weights, DoReMi trains a small (~280M) proxy model with Group DRO to find domain weights that minimize worst-case excess loss, then applies those weights to train the large model. Practically you also deliberately upsample code and multilingual data beyond their natural web frequency because they’re high-value and scarce.

Curriculum / annealing. Llama 3 (and others) anneal on top-quality data at the end: during the final phase of pretraining, decay the learning rate toward zero while shifting the mixture toward the highest-quality, highest-value sources (curated math, code, reference). The model spends its last, most-retained gradient steps on the best data.

Knob
Technique
Reference
Domain weights
DoReMi proxy weights (280M)
DoReMi
Scarce high-value data
Upsample code / multilingual
Llama 3, common practice
End-of-training
LR -> 0 anneal on top-quality data
Llama 3
Source selection
Fewer-but-better can beat more
DCLM

Staff insight — more sources can hurt. DCLM found that naively adding more data sources sometimes lowered downstream scores — a noisy or off-distribution source dilutes the mixture. Source inclusion is itself an ablation decision, not a default-yes.

Briefly: decontamination

Build a set of 13-grams from every example in the eval suite, then flag/remove any training document containing those n-grams (Llama-3-style). Trap: skipping or under-powering decontamination silently inflates benchmark numbers — the model has memorized the test set — and you only discover it when the model fails to generalize. Decontamination must run against the exact eval suite version used for reporting, and that version is pinned in the recipe.

Briefly: tokenize, global shuffle, shard

Tokenize with a pinned BPE tokenizer (vocab ~128k), pack token streams into fixed seq_len sequences (concatenating documents with separators to avoid padding waste), then globally shuffle so consecutive samples don’t come from the same source/snapshot. Write .bin (packed uint32) + .idx (offsets) shards for O(1) seekable mmap reads. Trap: recomputing this — or the whole pipeline — per ablation burns millions of CPU-hours; content-addressed caching means an ablation that only changes the mix weights reuses every upstream artifact and only re-runs mixing + tokenize.

7

Multi-team rollout

Operate it like production data infrastructure, with the model-eval ablation as the release gate.

Per-stage quality gates and validation:

  • Per-stage data-quality metrics: survival rate, language distribution, mean quality score, dedup cluster-size distribution, token counts — tracked per stage and alerted on drift.
  • Schema/format validation between stages: every stage asserts its input schema so a malformed upstream output fails fast rather than silently poisoning the corpus.
  • Canary sampling + human spot-checks: sample N docs per stage for human review; catches extraction regressions (e.g. a Trafilatura version that starts keeping nav boilerplate) that metrics miss.

Reliability:

  • Checkpoint-restart on preempted/spot nodes: stages are idempotent and write to content-addressed prefixes, so a killed worker's work is safely redone.
  • Cost/throughput dashboards: CPU-hours per stage, $/T-tokens, queue depth — so you catch a stage that's quietly 5x over budget.

The release gate — ablation workflow. No recipe change ships on intuition. The workflow: change the recipe -> build (mostly cached) the new dataset version -> train a small proxy model -> evaluate on the downstream suite -> compare to the current recipe. Only an improvement (or neutral-with-other-benefit) ships. This is the analog of CI/CD for data.

Compliance & reproducibility:

  • Dataset versions and recipe configs live in source control; the manifest makes any version byte-reproducible from (recipe, seed).
  • Audit/takedown handling: lineage lets you answer "is this URL in version X?" and remove it, producing a new compliant version — required for licensing disputes, PII requests, and opt-outs.
8

Bottlenecks & evolution

Current limits:

Limit
Why it bites
Direction
Unique-token scarcity
High-quality web text is finite vs growing token budgets
Synthetic/rephrased data, multi-epoch, multilingual expansion
Dedup connected-components scaling
Union-find graph/hot buckets OOM at trillion scale
Partitioned/streaming union-find, bucket caps
Quality-classifier bias
Aggressive filtering narrows the distribution
Per-domain cutoffs, distribution monitoring
Loader as train-time bottleneck
If shards can't saturate 10k+ GPUs, compute idles
Prefetch, sharding to avoid hot keys, near-cluster replication

Observability: the through-line is that the metric that matters is downstream eval movement, so the most important “monitor” is a tight quality-to-eval feedback loop — cheap proxy-model ablations run continuously as the recipe evolves, not just at release.

Evolution — the shift from “more data” to “better data”:

  • Model-based filtering everywhere (FineWeb-Edu, DCLM) is now the default, not an enhancement.
  • Synthetic and rephrased data to escape unique-token scarcity and to densify scarce skills (math, code, reasoning).
  • Multimodal corpora (image/audio/video-text) as the same pipeline generalizes beyond text.
  • Continual ingestion of new crawls to keep the corpus fresh.
  • Scaling-law-driven token budgets: the token count is chosen from compute-optimal scaling laws, and the bet has decisively moved from accumulating more tokens to curating better ones.

Summary

1. Data quality is the dominant lever on model capability — for pretraining, the corpus, not the architecture, is most of the story. Aggressive eval-validated filtering (keep ~top 10%, per DCLM) beats far larger unfiltered pools.

2. It’s an offline, reproducible, ablation-driven batch pipeline whose core hard problems are trillion-scale fuzzy dedup (MinHash banding + distributed union-find), learned quality filtering (fastText/distilled classifier with an eval-chosen cutoff), rigorous decontamination (13-gram vs the pinned eval suite), deliberate mixing/curriculum (DoReMi weights, code/multilingual upsampling, end-of-training anneal), and a tokenized-shard contract that never starves the cluster.

3. Treat the dataset as a versioned recipe with content-addressed intermediates, manifests/lineage, and idempotent checkpoint-restart stages — so ablations are cheap and any version is byte-reproducible and auditable. Strong answers cite FineWeb/FineWeb-Edu, DCLM, Llama 3, and datatrove as current practice.

4. Prove every choice with downstream evals, not perplexity. Perplexity-tuned filters don’t transfer; global dedup can hurt; more sources can hurt; skipping decontamination silently inflates benchmarks. “Correct” here means a model trains measurably better — provable only with controlled small-model experiments.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Pipeline & orchestration
Lists ingest→filter→tokenize stages.
Designs a DAG of idempotent, restartable, content-addressed stages over Slurm/Ray/Spark with checkpoint-restart on preempted nodes.
Deduplication
Removes exact duplicates by hash.
Layers Bloom/exact then MinHash-LSH (5-grams, ~112 hashes, 0.85 Jaccard) + union-find; per-snapshot first, knows global dedup can hurt.
Quality filtering
Applies length/symbol heuristics.
Adds a learned classifier (top ~10%), validates on downstream benchmarks not perplexity, and watches for distribution bias.
Decontamination
Mentions removing test data.
Runs 13-gram overlap against all reported eval sets with detect-then-confirm, logged per-benchmark and tied to the recipe version.
Data mixing & curriculum
Mixes sources roughly equally.
Sets DoReMi proxy weights, upsamples code/multilingual, anneals on top-quality data at the end; knows more sources can hurt.
Reproducibility & cost
Notes the pipeline is expensive.
Versions every artifact as a recipe and caches intermediates so ablations re-run only the cheap tail, saving millions of CPU-hours.
Loader contract
Tokenizes the corpus.
Pre-tokenizes, globally shuffles, and writes packed indexed mmap shards so the loader saturates 10k+ GPUs with <1% idle.
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →