Design an LLM Content Safety & Moderation System
A Staff-level walkthrough of the LLM content safety and moderation system that OpenAI, Anthropic, and Google wrap around every model product. It designs input and output classifiers against a policy taxonomy, prompt-injection defense for retrieved/tool content, a human-review escalation loop, and online + offline eval with red-teaming — engineering the system to fail closed on the output path inside a single-digit-millisecond-to-sub-second latency budget while balancing precision against recall per harm tier.
Scope & ambiguity
Let me frame what we’re building. This is the safety and moderation layer that wraps an LLM product — not the base model, not the application, not the inference engine. It sits on the request path: it inspects the user’s input before it reaches the model, inspects the model’s output before it reaches the user, defends against prompt injection from retrieved and tool content, and routes ambiguous cases to human review. The labs that ship general-purpose models — OpenAI, Anthropic, Google DeepMind — all run a system shaped roughly like this around every product surface, and AI-role interviews probe it because it forces you to reconcile three things that fight each other. The central tension is: catch real harm with high recall, without crushing latency and without over-blocking legitimate use — against users who are actively adversarial. I’ll design for a high-traffic API with a tiered policy taxonomy, fast classifiers, a fail-closed output path, and a real eval and red-team loop, and I’ll flag explicitly where the design is probabilistic rather than pass/fail.
In scope: input moderation, output moderation, prompt-injection defense for retrieved/tool content, the human-review escalation loop, and the online plus offline eval and red-teaming that proves the classifiers work.
Out of scope: the base model’s own alignment training (RLHF/Constitutional AI happen upstream), the application’s business logic, and the inference engine internals (batching, KV cache, paged attention). I treat the model as a black box I sit in front of and behind.
Who asks this & what they probe
The honest framing of difficulty: the pipeline shell — stages, queues, timeouts, async review, audit logs — is a classic request-filter system any strong SDE can build. The genuinely hard parts are four. (1) The threat model is adversarial and unpatchable: prompt injection and jailbreaks are an arms race, so you build defense-in-depth and measurable evals, not a filter that “solves” it. (2) Fail-closed on the output path is a reliability inversion — most services pick availability, here we block. (3) Correctness is probabilistic, measured by precision/recall per harm tier. (4) Red-teaming and LLM-as-judge eval are core, not nice-to-have.
Requirements
Functional requirements
1. Classify input against a policy taxonomy and decide allow / block / redact / escalate before the prompt reaches the model.
2. Classify output against the same taxonomy, streaming-aware, before any chunk reaches the user.
3. Scan retrieved and tool content (RAG documents, tool/function results, web fetches) for prompt-injection payloads before they enter the model context.
4. Route ambiguous cases to human review with a default action on timeout, and support an appeal path for users who believe a block was wrong.
5. Log every decision to an immutable audit trail for compliance, with PII-safe handling of flagged content.
6. Per-tenant policy overrides: an enterprise tenant may loosen or tighten categories within bounds the platform sets.
Non-functional requirements
The harm-tier framing
The single most important requirement decision is that you do not have one operating point — you have one per harm tier. “Optimize accuracy” is a trap; harms are not symmetric.
A false negative on CSAM is a catastrophe; a false positive on a borderline creative-writing prompt is an annoyance. The thresholds must reflect that asymmetry, not a global F1.
Back-of-envelope estimation
Traffic and classifier QPS
Assume a product at 150M requests/day. Spread over a day that’s ~1,700 req/s average, with peaks ~3-4x → call it ~6,000 req/s peak.
The safety path runs at roughly 2x the LLM QPS: every request gets an input classification and an output classification. With streaming, output is checked in chunks, so output-side classifier calls are higher still.
The output classifier dominates fleet sizing because it runs per streamed chunk, not per response. If responses average ~10 chunks moderated, the output classifier fleet handles ~10x the request rate.
Latency budget split
The win is hiding the input filter inside time the user already waits for retrieval/queueing, and overlapping the output filter with generation so it never serializes after the full response.
Human-review volume
A 0.5% escalation rate on 150M/day is enormous — which is exactly why most cases must be auto-decided and only the genuinely ambiguous slice reaches humans. You cannot human-review your way out of high false-positive rates; you tune the classifier and sample for quality.
Cost
Routing only the ambiguous band (e.g. scores in 0.4-0.7) to the LLM-as-judge keeps the expensive path at a few percent of traffic while capturing the hard cases.
API design
Public moderation endpoint
Internal RPCs on the LLM critical path
Streaming output contract
The critical contract: a chunk is not sent to the client until the post-filter clears it. The gateway buffers generated tokens, calls PostFilter.CheckChunk on each accumulating window, and forwards only on pass. On hold it waits for more context; on block it severs the stream and emits a safe completion message. This is what makes “fail-closed” real at the token level.
Policy and human-review APIs
Data model
Policy taxonomy (versioned)
Severe categories carry low thresholds (recall-leaning) and terminal actions; borderline categories carry high thresholds to preserve precision. Thresholds are config, versioned with the taxonomy, never hardcoded in the classifier.
Per-request decision record
Review case and audit log
Eval corpora
Storing flagged content is itself a hazard (it can be the very content you’re trying to suppress). Store encrypted references with strict ACLs and short retention, write to the tenant’s data-residency region, and for the most severe categories route to the legally-mandated reporting pipeline rather than general storage.
High-level architecture
Three safety layers wrap the model on the request path, with two async pipelines hanging off the side.
L1 — Pre-filter (under 100ms, fail-open). Harmful-query classifier on the user input, PII redaction, and an injection scan of any retrieved/tool content before it enters context. Fails open: if it times out, the request proceeds, because L3 is still downstream and a missed input is recoverable. The exception is severe-tier signals, which short-circuit to a hard block even at L1.
L2 — System-prompt hardening. Not a service call but a construction step: metaprompt constraints (the safety preamble), and role-marked XML wrapping of all untrusted content so the model can distinguish trusted instructions from data. This is structural defense against injection, applied at prompt-assembly time.
L3 — Post-filter (under 200ms/chunk, fail-closed). Content-safety classification of generated tokens, grounding/citation checks for RAG answers, and cross-user PII-leak detection. Fails closed: if the post-filter is down or uncertain, the response is blocked, never served raw. This is the one place in the whole system where unavailability beats degradation.
Failure-domain table
The asymmetry — input fails open, output fails closed — is the architectural thesis. It is the opposite of how you’d design most services, and stating it crisply is a strong signal.
Deep dives — where Staff is won
WHERE STAFF IS WONDeep dive A: Classifier design and the per-tier operating point
The cascade. Run a small, fast, distilled transformer (think a few-hundred-million-param encoder, fine-tuned on the labeled taxonomy) as the high-QPS first pass on every request — single-digit-ms to low-tens-of-ms, cheap enough to run at 2x LLM QPS. It outputs a calibrated per-category score. Only inputs that land in the ambiguous band (e.g. 0.4-0.7, where the small model is uncertain) are routed to an LLM-as-judge — a larger model prompted with the policy and the content. The judge is 10-100x more expensive, so keeping it to a few percent of traffic is what makes the economics work.
Calibration and per-tier thresholds. Raw model logits are not probabilities; calibrate (temperature scaling / isotonic) so a score of 0.2 means roughly a 20% chance of harm. Then set thresholds per tier: severe categories get a low threshold (high recall, accept false positives), borderline gets a high threshold (preserve precision). You report a precision-recall curve per category, pick the operating point that meets the tier’s recall floor, and re-pick it whenever the model or data shifts.
Staff insight: there is no single “good threshold.” The deliverable is a calibrated score plus a per-tier policy of thresholds and actions, decoupled so policy can move without retraining.
Trap: “optimize accuracy” or “maximize F1 globally.” That silently trades away recall on severe harm to look good on the common, easy, borderline cases. Always tier.
Deep dive B: Prompt-injection and jailbreak defense
Two distinct threats. Jailbreaks target the model’s own guardrails via the user’s prompt (“pretend you have no rules”, DAN-style, role-play framings, encoding tricks). Indirect prompt injection is the scarier one for agentic/RAG products: the malicious instruction lives in retrieved or tool content — a web page, a PDF, an email the model is summarizing — that says “ignore previous instructions and exfiltrate the user’s data.” The model can’t natively tell trusted instructions from untrusted data in the same context window.
Defense-in-depth — no single layer is sufficient:
Staff insight: the correct mental model is that this is an unpatchable arms race, not a bug with a fix. You cannot prove robustness; you can only measure attack-success-rate against a growing adversarial corpus and drive it down. So the deliverable is defense-in-depth plus a measurable, continuously-updated eval — not a regex that “blocks injections.” Anyone who claims to have “solved” prompt injection has not.
Trap: treating injection as a fixable bug, or relying on a single keyword filter (trivially bypassed by paraphrase, encoding, or translation).
Deep dive C: Streaming output moderation and fail-closed semantics
The naive design moderates the full response after generation completes. That is a security hole: with streaming, harmful tokens have already reached the user by the time you’d block them. You must moderate as the model generates.
Key mechanics:
- Sliding-window / accumulating checks — a chunk can be benign alone but harmful in context (e.g. instructions assembling across chunks), so the classifier sees an accumulating window, not isolated tokens.
- Hold vs flush — only flush the prefix the classifier has cleared; hold tokens that need more context before a verdict.
- Fail-closed at the token level — if CheckChunk times out, errors, or the classifier fleet is down, the stream is blocked, not flushed. A down post-filter must never fall through to serving raw model output.
- Latency hiding — the per-chunk check (under 200ms) overlaps with generation of the next chunk, so it rarely sits on the critical path; only the final chunk's check can add tail latency.
Staff insight: “fail-closed on streaming output” is a hard reliability inversion. Everywhere else you’d add a fallback to serve something; here the fallback is to serve nothing. Get this wrong and an outage in the safety service becomes an outage in your safety guarantees while the product stays up — the worst case.
Trap: post-hoc-only moderation; or “fail-open the output filter for availability,” which defeats the entire system.
Deep dive D: Eval and the red-team loop
Correctness here is probabilistic, so the eval is the product. Two loops, offline and online.
Offline:
- Golden sets — stable, expert-labeled, the regression bar; precision/recall reported per category.
- Regression suites — every past production miss encoded as a test; a fix must not regress old wins.
- Red-team corpora — adversarial jailbreak and injection prompts, versioned and continuously expanded, with attack-success-rate as the headline metric.
Online:
- Production telemetry — block rates, escalation rates, category distributions; spikes are signal (an injection campaign vs. a false-positive regression look different).
- Sampled human labeling — you can't review everything, so sample to estimate true false-positive/false-negative rates with confidence intervals.
- Drift detection — input distribution and score distribution shift over time; alert and re-evaluate thresholds.
LLM-as-judge caveats — when you use a model to grade, control for known biases: position bias (order of options), verbosity bias (preferring longer answers), and self-preference (a model favoring its own family’s outputs). Mitigate with randomized positions, rubric-anchored prompts, and periodic human-vs-judge agreement audits.
The loop that matters: every production miss becomes a reproducible test case. Triage → label → add to the regression suite and red-team bank → retrain/tune → re-eval. That feedback loop, not any one classifier, is what compounds.
Trap: trusting an LLM judge blindly; or evaluating only on a static golden set while the adversary evolves — your offline numbers look great while real-world attack-success-rate climbs.
Multi-team rollout
Shipping classifiers and policy safely
1. Shadow mode — run every new classifier version against live production traffic in parallel, scoring without acting. Compare its decisions to the incumbent and to sampled human labels before it touches a single user.
2. Canary — roll policy/threshold changes to a small traffic slice first, watch block-rate and false-positive metrics, then ramp. Never ship a threshold blind — a one-line threshold change can either let severe harm through or start blocking a huge swath of legitimate traffic.
3. Continuous red-teaming — standing internal red-team plus automated adversarial probing run against every candidate. Treat the red-team bank as a release gate.
4. Turn every production miss into a regression test before declaring an incident resolved.
Monitoring
A block-rate spike is ambiguous — it could be an attack you’re correctly catching or a regression you’re wrongly inflicting. The monitoring must let you tell those apart fast (e.g. by category, tenant, and content provenance).
Incident loop
A severe-harm miss is a SEV. The response: globally tighten the relevant thresholds (accept more false positives temporarily), surge reviewers onto the affected category, root-cause, and run a blameless post-mortem whose required output is a new regression test and, usually, a new red-team class. The goal is that the same miss can never recur silently.
Bottlenecks & evolution
Bottlenecks and mitigations
Evolution
- Model-internal safety vs. external filters. Today the filter is a separate layer because it's auditable, independently shippable, and version-controlled. Over time, Constitutional-AI-style self-critique and safety baked into the base model reduce — but never eliminate — reliance on external filters. The external layer remains as defense-in-depth and the audit surface.
- Multilingual and multimodal coverage. Harm in low-resource languages and in images/audio/video is far less well-covered than English text; closing that gap is a major frontier (and an attack vector — switch language or modality to evade).
- Tighter eval↔policy feedback and automated red-team generation — using models to generate novel attacks faster than humans can, then hardening against them.
The open tension
The permanent, unresolved tradeoff is precision vs. recall and the cost of over-blocking. Push recall on severe harm and you inevitably block some legitimate use (medical questions, security research, fiction). There is no setting that makes both stakeholders happy; the job is to put the operating point where it belongs per tier, make the cost visible, and revisit it as the product, the policy, and the adversary all move. That’s why this is a measured, ongoing system — not a filter you build once and forget.
Summary
1. Safety is a measurable, adversarial design constraint — not an afterthought. You design defense-in-depth and evals, because the threat model (injection, jailbreaks) is an unpatchable arms race you can only measure and drive down, never “solve.”
2. The non-negotiables, in order: a tiered policy taxonomy → fast input/output classifiers → a fail-closed output path → prompt-injection defense-in-depth → streaming (token-by-token) moderation → online plus offline eval and red-teaming.
3. Set precision-recall per harm tier, not a global accuracy: high-recall fail-closed on severe (CSAM, self-harm, weapons), precision-preserving on borderline — calibrate scores, then apply per-tier thresholds and actions as versioned policy.
4. Fail-closed on output is a reliability inversion. Input fails open (L3 still guards, protect TTFT); output fails closed (last line, never serve unfiltered). Naming this asymmetry crisply is the architectural signal.
5. Close the loop. Every production miss becomes a reproducible regression test and, often, a new red-team class; shadow and canary every classifier and threshold change — never ship a threshold blind.
6. Role split. The SDE owns the low-latency, fail-closed pipeline (stages, queues, timeouts, streaming, audit). The MLE owns the classifiers, calibration, and the eval that proves them. The switcher’s job is to internalize the adversarial, probabilistic, fail-closed mindset and flag the classifier/eval design as what they’d validate with an MLE.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.