AI System DesignStaffLLM SafetyModeration

Design an LLM Content Safety & Moderation System

A Staff-level walkthrough of the LLM content safety and moderation system that OpenAI, Anthropic, and Google wrap around every model product. It designs input and output classifiers against a policy taxonomy, prompt-injection defense for retrieved/tool content, a human-review escalation loop, and online + offline eval with red-teaming — engineering the system to fail closed on the output path inside a single-digit-millisecond-to-sub-second latency budget while balancing precision against recall per harm tier.

Level: Staff
Category: AI Infrastructure · Trust & Safety
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Policy taxonomy driving both input and output classifiers

·Fail-closed on the output path vs fail-open on the input path, within a latency budget

·Prompt-injection / jailbreak defense for retrieved and tool-returned content

·Online + offline eval with golden sets, red-teaming, and a human-review escalation loop

★ STAFF-LEVEL SIGNALS

★Sets a precision-recall operating point per harm tier — high-recall fail-closed on severe harms, precision-preserving on borderline

★Moderates streaming output token-by-token rather than only after generation completes

★Treats prompt injection as an unpatchable arms race: defense-in-depth + measurable evals, not a 'fix'

★Closes the loop: every production miss becomes a reproducible regression test, with drift monitoring

Scope & ambiguity

Let me frame what we’re building. This is the safety and moderation layer that wraps an LLM product — not the base model, not the application, not the inference engine. It sits on the request path: it inspects the user’s input before it reaches the model, inspects the model’s output before it reaches the user, defends against prompt injection from retrieved and tool content, and routes ambiguous cases to human review. The labs that ship general-purpose models — OpenAI, Anthropic, Google DeepMind — all run a system shaped roughly like this around every product surface, and AI-role interviews probe it because it forces you to reconcile three things that fight each other. The central tension is: catch real harm with high recall, without crushing latency and without over-blocking legitimate use — against users who are actively adversarial. I’ll design for a high-traffic API with a tiered policy taxonomy, fast classifiers, a fail-closed output path, and a real eval and red-team loop, and I’ll flag explicitly where the design is probabilistic rather than pass/fail.

In scope: input moderation, output moderation, prompt-injection defense for retrieved/tool content, the human-review escalation loop, and the online plus offline eval and red-teaming that proves the classifiers work.

Out of scope: the base model’s own alignment training (RLHF/Constitutional AI happen upstream), the application’s business logic, and the inference engine internals (batching, KV cache, paged attention). I treat the model as a black box I sit in front of and behind.

Who asks this & what they probe

Role

Focus

What they probe

SDE

Low-latency fail-closed pipeline

Pre/post-filter stages, fail-open input vs fail-closed output domains, latency budget under 100-200ms without blowing TTFT, async review queue with timeout defaults, versioned policy engine, per-tenant overrides, token-by-token streaming moderation, classifier as a dependency with timeout/fallback/circuit breaker, audit and appeal trail

MLE

Classifiers and the eval that proves them

Taxonomy to labeled data, small transformer vs LLM-as-judge, precision-recall operating point per harm tier, injection/jailbreak detection, calibration and thresholds, online drift plus offline golden sets and red-team corpora, adversarial robustness, label noise, turning misses into tests

Switcher (SDE to AI)

Adversarial, probabilistic, fail-closed mindset

That injection is an unpatchable arms race not a bug, that fail-closed inverts normal availability priority, that correctness is precision/recall per tier not pass/fail, and that red-teaming and LLM-as-judge eval are first-class — naming these and deferring classifier internals to an MLE

The honest framing of difficulty: the pipeline shell — stages, queues, timeouts, async review, audit logs — is a classic request-filter system any strong SDE can build. The genuinely hard parts are four. (1) The threat model is adversarial and unpatchable: prompt injection and jailbreaks are an arms race, so you build defense-in-depth and measurable evals, not a filter that “solves” it. (2) Fail-closed on the output path is a reliability inversion — most services pick availability, here we block. (3) Correctness is probabilistic, measured by precision/recall per harm tier. (4) Red-teaming and LLM-as-judge eval are core, not nice-to-have.

Requirements

Functional requirements

1. Classify input against a policy taxonomy and decide allow / block / redact / escalate before the prompt reaches the model.

2. Classify output against the same taxonomy, streaming-aware, before any chunk reaches the user.

3. Scan retrieved and tool content (RAG documents, tool/function results, web fetches) for prompt-injection payloads before they enter the model context.

4. Route ambiguous cases to human review with a default action on timeout, and support an appeal path for users who believe a block was wrong.

5. Log every decision to an immutable audit trail for compliance, with PII-safe handling of flagged content.

6. Per-tenant policy overrides: an enterprise tenant may loosen or tighten categories within bounds the platform sets.

Non-functional requirements

Requirement

Target

Why

Input-filter latency

Adds under 100ms; fail-open

Don't crush TTFT; a missed input block is recoverable at output

Output-filter latency

Under 200ms per chunk; fail-closed

Last line of defense; never serve unfiltered

Severe-harm recall

Near-zero tolerance for misses

CSAM, self-harm, credible violence/weapons

Borderline precision

Preserve precision

Over-blocking legitimate use is a real cost

Availability of pipeline

Input path fails open, output path fails closed

Different failure domains by design

Auditability

100% of decisions logged, immutable

Regulatory and incident response

Multi-tenant isolation

Policy + data residency per tenant

Enterprise and compliance

The harm-tier framing

The single most important requirement decision is that you do not have one operating point — you have one per harm tier. “Optimize accuracy” is a trap; harms are not symmetric.

Tier

Examples

Operating point

Failure mode on the bad side

Severe

CSAM, self-harm/suicide, credible mass-violence, bioweapons uplift

High-recall, fail-closed

Catastrophic, irreversible — accept false positives

High

Hate, harassment, regulated-goods, explicit sexual

Recall-leaning

Serious but recoverable

Borderline

Edgy creative writing, medical/legal info, profanity

Precision-preserving

Over-blocking erodes product value

A false negative on CSAM is a catastrophe; a false positive on a borderline creative-writing prompt is an annoyance. The thresholds must reflect that asymmetry, not a global F1.

Back-of-envelope estimation

Traffic and classifier QPS

Assume a product at 150M requests/day. Spread over a day that’s ~1,700 req/s average, with peaks ~3-4x → call it ~6,000 req/s peak.

The safety path runs at roughly 2x the LLM QPS: every request gets an input classification and an output classification. With streaming, output is checked in chunks, so output-side classifier calls are higher still.

LLM requests/day ~150,000,000

Avg LLM QPS ~1,700

Peak LLM QPS ~6,000

Input classifier QPS (peak) ~6,000

Output classifier calls per-chunk; ~5-20x request count

RAG/tool injection scans per retrieved doc (RAG traffic only)

The output classifier dominates fleet sizing because it runs per streamed chunk, not per response. If responses average ~10 chunks moderated, the output classifier fleet handles ~10x the request rate.

Latency budget split

User -> [input filter <100ms, fail-open] -> LLM (TTFT) ->

[output filter <200ms/chunk, fail-closed] -> User

Input pre-filter: <100ms (fail-open on timeout)

Per-chunk post: <200ms per streamed chunk (fail-closed)

Injection scan: amortized into retrieval latency (RAG only)

The win is hiding the input filter inside time the user already waits for retrieval/queueing, and overlapping the output filter with generation so it never serializes after the full response.

Human-review volume

Quantity

Estimate

Escalation rate

0.1% - 1% of traffic

Daily cases at 0.5%

~750,000/day

Reviewer throughput

~tens to low-hundreds/hr per reviewer

Queue latency target

Seconds (urgent) to hours (routine)

A 0.5% escalation rate on 150M/day is enormous — which is exactly why most cases must be auto-decided and only the genuinely ambiguous slice reaches humans. You cannot human-review your way out of high false-positive rates; you tune the classifier and sample for quality.

Cost

Small classifier (distilled transformer): cheap, ~ms, runs on every request

LLM-as-judge (large model): 10-100x cost & latency

Strategy: cascade — small model first; route only

the ambiguous middle band to the heavy judge

Routing only the ambiguous band (e.g. scores in 0.4-0.7) to the LLM-as-judge keeps the expensive path at a few percent of traffic while capturing the hard cases.

API design

Public moderation endpoint

POST /v1/moderations

{

"input": "<text or [content parts]>",

"context": { "source": "user|retrieved|tool",

"tenant_id": "...", "conversation_id": "..." },

"policy_version": "2026-06-01" // optional pin

}

200 OK

{

"id": "modr_...",

"policy_version": "2026-06-01",

"flagged": true,

"categories": {

"self_harm": true, "hate": false, "violence": false,

"sexual_minors": false, "prompt_injection": false

"category_scores": { "self_harm": 0.93, "hate": 0.02, ... },

"tier": "severe",

"action": "block" // allow | redact | block | escalate

}

Internal RPCs on the LLM critical path

PreFilter.Check(req) -> { action, redactions[], scores }

# fail-open: timeout/error => ALLOW

PostFilter.CheckChunk(chunk, ctx, accumulated)

-> { action: pass|hold|block, scores }

# fail-closed: timeout/error => BLOCK

InjectionScan.Scan(doc, provenance)

-> { injection_detected, spans[] }

Streaming output contract

The critical contract: a chunk is not sent to the client until the post-filter clears it. The gateway buffers generated tokens, calls PostFilter.CheckChunk on each accumulating window, and forwards only on pass. On hold it waits for more context; on block it severs the stream and emits a safe completion message. This is what makes “fail-closed” real at the token level.

Policy and human-review APIs

PolicyEngine.Resolve(tenant_id, policy_version)

-> { taxonomy, thresholds[per tier], tenant_overrides, actions }

ReviewQueue.Enqueue(case) # content ref + flags + scores

ReviewQueue.Claim(reviewer_id) # lease a case

ReviewQueue.Decide(case_id, decision, rationale)

Appeals.File(decision_id, user_note)

AuditLog.Append(event) # append-only, immutable

Data model

Policy taxonomy (versioned)

PolicyTaxonomy {

version: "2026-06-01" # immutable once published

categories: [

{ id: "sexual_minors", tier: "severe",

threshold: 0.15, action_on_flag: "block_and_report" },

{ id: "self_harm", tier: "severe",

threshold: 0.20, action_on_flag: "block_or_safe_complete" },

{ id: "hate", tier: "high",

threshold: 0.50, action_on_flag: "block" },

{ id: "creative_violence", tier: "borderline",

threshold: 0.80, action_on_flag: "allow_with_log" },

...

]

tenant_overrides: { tenant_id -> { category -> threshold } }

}

Severe categories carry low thresholds (recall-leaning) and terminal actions; borderline categories carry high thresholds to preserve precision. Thresholds are config, versioned with the taxonomy, never hardcoded in the classifier.

Per-request decision record

Decision {

id, request_id, tenant_id, stage: "pre"|"post",

policy_version, category_scores: {cat -> float},

decision: allow|redact|block|escalate,

model_version, latency_ms, ts,

content_ref # pointer, not raw content

}

Review case and audit log

Entity

Key fields

Notes

ReviewCase

content_ref, flags, scores, reviewer_id, decision, ts, appeal_status

Written to the tenant's residency region

AuditLog

event, actor, before/after, policy_version, ts

Append-only, immutable, compliance-grade

FlaggedContent

encrypted blob, retention TTL, access ACL

PII-safe, minimal retention, restricted access

Eval corpora

GoldenSet # hand-labeled, stable, the regression bar

RegressionSuite # every past production miss, as a test case

RedTeamBank # jailbreak + injection prompts, versioned

ProdMissCases # sampled false negatives, triaged into the above

Storing flagged content is itself a hazard (it can be the very content you’re trying to suppress). Store encrypted references with strict ACLs and short retention, write to the tenant’s data-residency region, and for the most severe categories route to the legally-mandated reporting pipeline rather than general storage.

High-level architecture

Three safety layers wrap the model on the request path, with two async pipelines hanging off the side.

┌──────────────────────── Request path ─────────────┐

User ─▶ Gateway ─▶ L1 PRE-FILTER ─▶ [System-prompt L2] ─▶ LLM

(<100ms, │

fail-OPEN) │ stream

│ ▼

│ L3 POST-FILTER (per chunk,

│ <200ms, fail-CLOSED)

│ │

└────────── escalate ──┐ ▼

│ User (cleared

▼ chunks only)

┌─────────────────┐

│ Review Queue │ (async)

└─────────────────┘

│

┌─────────────────┐

│ Eval/Telemetry │ (async)

│ drift, red-team │

└─────────────────┘

L1 — Pre-filter (under 100ms, fail-open). Harmful-query classifier on the user input, PII redaction, and an injection scan of any retrieved/tool content before it enters context. Fails open: if it times out, the request proceeds, because L3 is still downstream and a missed input is recoverable. The exception is severe-tier signals, which short-circuit to a hard block even at L1.

L2 — System-prompt hardening. Not a service call but a construction step: metaprompt constraints (the safety preamble), and role-marked XML wrapping of all untrusted content so the model can distinguish trusted instructions from data. This is structural defense against injection, applied at prompt-assembly time.

L3 — Post-filter (under 200ms/chunk, fail-closed). Content-safety classification of generated tokens, grounding/citation checks for RAG answers, and cross-user PII-leak detection. Fails closed: if the post-filter is down or uncertain, the response is blocked, never served raw. This is the one place in the whole system where unavailability beats degradation.

Failure-domain table

Component

Failure mode

Behavior

Rationale

L1 pre-filter

Timeout / down

Fail-open (allow), L3 still guards

TTFT protection; recoverable

L1 severe signal

High-confidence severe flag

Hard block immediately

Don't wait for output

L2 hardening

N/A (in-process)

Always applied

Cheap, structural

L3 post-filter

Timeout / down

Fail-closed (block)

Last line; never serve unfiltered

Review queue

Backed up

Apply default action on timeout

Don't hang the user

Audit log

Write fails

Block + alert (compliance)

Auditability is non-negotiable

The asymmetry — input fails open, output fails closed — is the architectural thesis. It is the opposite of how you’d design most services, and stating it crisply is a strong signal.

Deep dives — where Staff is won

WHERE STAFF IS WON

Deep dive A: Classifier design and the per-tier operating point

The cascade. Run a small, fast, distilled transformer (think a few-hundred-million-param encoder, fine-tuned on the labeled taxonomy) as the high-QPS first pass on every request — single-digit-ms to low-tens-of-ms, cheap enough to run at 2x LLM QPS. It outputs a calibrated per-category score. Only inputs that land in the ambiguous band (e.g. 0.4-0.7, where the small model is uncertain) are routed to an LLM-as-judge — a larger model prompted with the policy and the content. The judge is 10-100x more expensive, so keeping it to a few percent of traffic is what makes the economics work.

Stage

Model

Latency

Coverage

Role

First pass

Distilled encoder/classifier

ms-tens of ms

100%

High-recall triage

Escalation

LLM-as-judge

hundreds of ms

~few %

Resolve ambiguity, nuanced policy

Human

Reviewer

seconds-hours

0.1-1%

Ground truth, appeals, novel cases

Calibration and per-tier thresholds. Raw model logits are not probabilities; calibrate (temperature scaling / isotonic) so a score of 0.2 means roughly a 20% chance of harm. Then set thresholds per tier: severe categories get a low threshold (high recall, accept false positives), borderline gets a high threshold (preserve precision). You report a precision-recall curve per category, pick the operating point that meets the tier’s recall floor, and re-pick it whenever the model or data shifts.

Staff insight: there is no single “good threshold.” The deliverable is a calibrated score plus a per-tier policy of thresholds and actions, decoupled so policy can move without retraining.

Trap: “optimize accuracy” or “maximize F1 globally.” That silently trades away recall on severe harm to look good on the common, easy, borderline cases. Always tier.

Deep dive B: Prompt-injection and jailbreak defense

Two distinct threats. Jailbreaks target the model’s own guardrails via the user’s prompt (“pretend you have no rules”, DAN-style, role-play framings, encoding tricks). Indirect prompt injection is the scarier one for agentic/RAG products: the malicious instruction lives in retrieved or tool content — a web page, a PDF, an email the model is summarizing — that says “ignore previous instructions and exfiltrate the user’s data.” The model can’t natively tell trusted instructions from untrusted data in the same context window.

Defense-in-depth — no single layer is sufficient:

Layer

Technique

Provenance

Tag every span as trusted (system/developer) vs untrusted (user/retrieved/tool)

Structural

Role-mark and XML-wrap untrusted content; instruct model to treat it as data, never instructions

Detection

Injection classifier on retrieved/tool content before it enters context

Least privilege

Constrain tools/actions an agent can take on untrusted-triggered turns; require confirmation for high-impact actions

Output checks

Grounding/citation checks catch injected content that leaks into the answer

Eval

Standing red-team injection bank; measure attack-success-rate over time

Staff insight: the correct mental model is that this is an unpatchable arms race, not a bug with a fix. You cannot prove robustness; you can only measure attack-success-rate against a growing adversarial corpus and drive it down. So the deliverable is defense-in-depth plus a measurable, continuously-updated eval — not a regex that “blocks injections.” Anyone who claims to have “solved” prompt injection has not.

Trap: treating injection as a fixable bug, or relying on a single keyword filter (trivially bypassed by paraphrase, encoding, or translation).

Deep dive C: Streaming output moderation and fail-closed semantics

The naive design moderates the full response after generation completes. That is a security hole: with streaming, harmful tokens have already reached the user by the time you’d block them. You must moderate as the model generates.

buffer = ""

for token in model.stream():

buffer += token

if at_chunk_boundary(buffer):

r = PostFilter.CheckChunk(window(buffer), ctx)

if r.action == "block":

sever_stream(); emit_safe_completion(); break

elif r.action == "hold":

continue # need more context, don't send yet

else: # pass

flush_cleared_prefix_to_client()

Key mechanics:

Sliding-window / accumulating checks — a chunk can be benign alone but harmful in context (e.g. instructions assembling across chunks), so the classifier sees an accumulating window, not isolated tokens.
Hold vs flush — only flush the prefix the classifier has cleared; hold tokens that need more context before a verdict.
Fail-closed at the token level — if CheckChunk times out, errors, or the classifier fleet is down, the stream is blocked, not flushed. A down post-filter must never fall through to serving raw model output.
Latency hiding — the per-chunk check (under 200ms) overlaps with generation of the next chunk, so it rarely sits on the critical path; only the final chunk's check can add tail latency.

Staff insight: “fail-closed on streaming output” is a hard reliability inversion. Everywhere else you’d add a fallback to serve something; here the fallback is to serve nothing. Get this wrong and an outage in the safety service becomes an outage in your safety guarantees while the product stays up — the worst case.

Trap: post-hoc-only moderation; or “fail-open the output filter for availability,” which defeats the entire system.

Deep dive D: Eval and the red-team loop

Correctness here is probabilistic, so the eval is the product. Two loops, offline and online.

Offline:

Golden sets — stable, expert-labeled, the regression bar; precision/recall reported per category.
Regression suites — every past production miss encoded as a test; a fix must not regress old wins.
Red-team corpora — adversarial jailbreak and injection prompts, versioned and continuously expanded, with attack-success-rate as the headline metric.

Online:

Production telemetry — block rates, escalation rates, category distributions; spikes are signal (an injection campaign vs. a false-positive regression look different).
Sampled human labeling — you can't review everything, so sample to estimate true false-positive/false-negative rates with confidence intervals.
Drift detection — input distribution and score distribution shift over time; alert and re-evaluate thresholds.

LLM-as-judge caveats — when you use a model to grade, control for known biases: position bias (order of options), verbosity bias (preferring longer answers), and self-preference (a model favoring its own family’s outputs). Mitigate with randomized positions, rubric-anchored prompts, and periodic human-vs-judge agreement audits.

The loop that matters: every production miss becomes a reproducible test case. Triage → label → add to the regression suite and red-team bank → retrain/tune → re-eval. That feedback loop, not any one classifier, is what compounds.

Trap: trusting an LLM judge blindly; or evaluating only on a static golden set while the adversary evolves — your offline numbers look great while real-world attack-success-rate climbs.

Multi-team rollout

Shipping classifiers and policy safely

1. Shadow mode — run every new classifier version against live production traffic in parallel, scoring without acting. Compare its decisions to the incumbent and to sampled human labels before it touches a single user.

2. Canary — roll policy/threshold changes to a small traffic slice first, watch block-rate and false-positive metrics, then ramp. Never ship a threshold blind — a one-line threshold change can either let severe harm through or start blocking a huge swath of legitimate traffic.

3. Continuous red-teaming — standing internal red-team plus automated adversarial probing run against every candidate. Treat the red-team bank as a release gate.

4. Turn every production miss into a regression test before declaring an incident resolved.

Monitoring

Signal

Watch for

Likely cause

Safety-block-rate spike

Sudden jump

Injection campaign vs. a false-positive regression

False-negative rate (sampled)

Upward drift

Model staleness, novel attack

False-positive rate (sampled)

Upward drift

Threshold too tight, distribution shift

Escalation queue depth

Growing

Classifier over-escalating, reviewer shortage

Latency p99 (filters)

Breach

Capacity, dependency degradation

A block-rate spike is ambiguous — it could be an attack you’re correctly catching or a regression you’re wrongly inflicting. The monitoring must let you tell those apart fast (e.g. by category, tenant, and content provenance).

Incident loop

A severe-harm miss is a SEV. The response: globally tighten the relevant thresholds (accept more false positives temporarily), surge reviewers onto the affected category, root-cause, and run a blameless post-mortem whose required output is a new regression test and, usually, a new red-team class. The goal is that the same miss can never recur silently.

Bottlenecks & evolution

Bottlenecks and mitigations

Bottleneck

Mitigation

Classifier latency on the hot path

Distill to small models; cascade so the heavy judge runs rarely; overlap checks with generation

Reviewer throughput

Sample rather than review-all; better tooling; auto-decide the confident tails

Adversarial drift

Continuous red-team; automated adversarial generation; turn misses into tests

LLM-as-judge cost

Route only the ambiguous middle band; cache; batch

Storing flagged content

Encrypted refs, short retention, residency-aware, mandated-reporting routing for severe

Evolution

Model-internal safety vs. external filters. Today the filter is a separate layer because it's auditable, independently shippable, and version-controlled. Over time, Constitutional-AI-style self-critique and safety baked into the base model reduce — but never eliminate — reliance on external filters. The external layer remains as defense-in-depth and the audit surface.
Multilingual and multimodal coverage. Harm in low-resource languages and in images/audio/video is far less well-covered than English text; closing that gap is a major frontier (and an attack vector — switch language or modality to evade).
Tighter eval↔policy feedback and automated red-team generation — using models to generate novel attacks faster than humans can, then hardening against them.

The open tension

The permanent, unresolved tradeoff is precision vs. recall and the cost of over-blocking. Push recall on severe harm and you inevitably block some legitimate use (medical questions, security research, fiction). There is no setting that makes both stakeholders happy; the job is to put the operating point where it belongs per tier, make the cost visible, and revisit it as the product, the policy, and the adversary all move. That’s why this is a measured, ongoing system — not a filter you build once and forget.

✓

Summary

1. Safety is a measurable, adversarial design constraint — not an afterthought. You design defense-in-depth and evals, because the threat model (injection, jailbreaks) is an unpatchable arms race you can only measure and drive down, never “solve.”

2. The non-negotiables, in order: a tiered policy taxonomy → fast input/output classifiers → a fail-closed output path → prompt-injection defense-in-depth → streaming (token-by-token) moderation → online plus offline eval and red-teaming.

3. Set precision-recall per harm tier, not a global accuracy: high-recall fail-closed on severe (CSAM, self-harm, weapons), precision-preserving on borderline — calibrate scores, then apply per-tier thresholds and actions as versioned policy.

4. Fail-closed on output is a reliability inversion. Input fails open (L3 still guards, protect TTFT); output fails closed (last line, never serve unfiltered). Naming this asymmetry crisply is the architectural signal.

5. Close the loop. Every production miss becomes a reproducible regression test and, often, a new red-team class; shadow and canary every classifier and threshold change — never ship a threshold blind.

6. Role split. The SDE owns the low-latency, fail-closed pipeline (stages, queues, timeouts, streaming, audit). The MLE owns the classifiers, calibration, and the eval that proves them. The switcher’s job is to internalize the adversarial, probabilistic, fail-closed mindset and flag the classifier/eval design as what they’d validate with an MLE.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Policy taxonomy

Lists harmful categories to block.

Builds a tiered policy taxonomy that maps to labeled data, per-tier thresholds, and distinct enforcement actions.

Pipeline & failure semantics

Adds a content filter before and after the model.

Designs fail-open input vs fail-closed output domains, a streaming token-by-token output check, and timeouts/circuit breakers.

Classifier design

Uses a moderation model.

Chooses small fast classifiers vs LLM-as-judge per stage, calibrates thresholds, and sets a precision-recall point per harm tier.

Prompt-injection defense

Mentions ignoring malicious instructions.

Scans retrieved/tool content for injection, role-marks untrusted input, and treats it as an unpatchable arms race with measurable evals.

Latency budget

Knows filters add latency.

Holds the safety path under <100–200ms / streaming, parallelizes the pre-filter, and never lets a slow classifier blow TTFT.

Human-review loop

Sends flagged content to reviewers.

Async default-deny review queue with timeout defaults, reviewer sampling, appeals, and a full audit trail.

Evaluation & red-teaming

Measures false positives/negatives.

Runs offline golden/regression + red-team corpora and online drift telemetry, turning every production miss into a reproducible test.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →