← Back to all questions
AI System DesignStaffCausal InferenceVariance Reduction

Design an Experimentation Platform with a Causal-Inference Stats Engine

An experimentation platform is two systems fused: a low-latency assignment/flag-delivery plane that buckets users deterministically and logs exposures, and a causal-inference stats engine that turns noisy metrics into a defensible ship decision. The companies that build these (Netflix, Airbnb, Booking.com, Spotify's Confidence) interview AI/ML and platform engineers on exactly this seam — the staff bar is owning the trust contract (SRM, variance reduction, peeking, interference) end to end, not just delivering flags.

Level
Staff
Category
AI System Design
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Can you keep assignment deterministic and consistent (sticky bucketing) across thousands of concurrent, layered experiments without cross-contamination?
·Do you stop peeking from inflating false positives — and can you explain mSPRT / always-valid p-values vs group-sequential?
·Do you detect Sample Ratio Mismatch and treat it as a decision-invalidating signal, not a metric?
·Do you reach for variance reduction (CUPED) and explain the sample-size win, plus FDR control for many metrics?
★ STAFF-LEVEL SIGNALS
Frames the platform as a TRUST CONTRACT: every result is guilty until SRM, exposure integrity, and peeking are cleared.
Knows when randomization breaks (marketplace interference / SUTVA) and reaches for switchback, cluster, or quasi-experimental designs.
Owns the org-scale failure mode: thousands of tests => false-discovery inflation, metric sprawl, and interaction effects, and designs guardrails + FDR for it.
Separates the assignment plane (sub-ms, must never fail) from the stats plane (batch, must never lie) with the right consistency model for each.
0

Frame the trust contract & who's asking

“A flag delivery system answers who sees what. An experimentation platform answers did it work, and can I trust the answer? The second question is the entire job.”

Netflix, Airbnb, Booking.com, and Spotify’s Confidence all run internal experimentation platforms, and they interview AI/ML and platform engineers on exactly this seam: the point where a low-latency serving system meets a causal-inference stats engine. The naive version of this design — deliver flags, run a t-test, read the p-value — will fail the interview, because every one of those companies has been burned by a “significant” result that was a peeking artifact, a silently broken split, or a marketplace launch where treatment leaked into control. This walkthrough is honest industry context for the kind of system these teams build, not a claim about any specific company’s interview question.

The real problem

The deliverable is not a flag. The deliverable is a defensible ship/no-ship decision that survives an adversarial review. The right mental model is a trust contract: every result is guilty until proven innocent — invalid until Sample Ratio Mismatch (SRM) is clear, exposures are counted correctly, and the result was read under a valid stopping rule. Staff candidates frame the platform as a decision engine, not a metrics calculator.

The two planes

Name these up front because they have opposite failure semantics and opposite consistency models:

1. Assignment / delivery plane — online, sub-millisecond, on the request path. It must never fail open to a half-rolled-out variant. Eventually-consistent config is fine.

2. Stats / decision plane — offline/batch, off the request path. It must never lie. It is allowed to be slow; it is not allowed to emit a clean-looking but invalid decision.

Who asks this & what they probe

Role
What they probe
SDE (platform)
Assignment determinism, sticky bucketing, exposure logging, exactly-once-ish semantics, sub-ms flag eval, kill-switch latency
MLE (stats)
CUPED variance reduction, sequential testing vs peeking, CATE/heterogeneity, SRM as a gate, FDR across many metrics
Switcher (SDE to AI)
Whether they see the inference layer as a trust contract — not a t-test bolted onto a flag system

The switcher framing matters: if you already own flags, hashing, low-latency lookups, and pipelines, you are 60% of the way here. The credential is the inference layer.

Scope cut

Aspect
This design
Out of scope
Core novelty
Stats + decision engine (CUPED, sequential, SRM, FDR, CATE)
Generic flag-flip toggles (that's the feature-flag question)
Assignment
Deterministic hashing, layers, sticky bucketing
Raw stream aggregation (that's event-analytics / ad-click)
Output
Trustworthy ship/no-ship scorecard
A real-time per-event dashboard for its own sake

Functional requirements: define a treatment in code; allocate a % of traffic; log exposure; compute per-variant metrics; emit ship/no-ship.

Non-functional requirements:

Requirement
Target
Assignment
Deterministic, sticky across sessions/devices, stateless on hot path
Scale
Thousands of concurrent experiments; 10^8–10^9 exposure events/day (Statsig's engineering blog cites streaming ~1T+ events/day, now several trillion/day)
Flag eval latency
Sub-millisecond, local/in-process
Decision trust
SRM-clean, exposure-correct, valid stopping rule
Kill-switch
Variant force-off within minutes

Explicit non-goals: this is not a generic flag-flip system, and it is not a raw stream-aggregation system. The hard, interesting part is the stats and decision engine on top.

1

Assignment, bucketing & layered experiments

The assignment plane must give every unit a deterministic, sticky variant with no per-user state in the hot path and no cross-contamination between concurrent experiments.

Deterministic hashing

Assignment is a pure function of (experiment salt, unit id):

bucket = hash(salt = experiment_id | layer, key = unit_id) mod N
variant = range_map(bucket) # bucket ranges -> variants

MurmurHash3 or MD5 followed by a modulo is the industry-standard primitive (used across Optimizely- and Google-style platforms). The same input always yields the same variant, so no lookup or database round-trip is needed to decide assignment — the function is the source of truth.

Salting (the anti-confounding move)

Salting the hash with the experiment/layer id is not cosmetic. Without it, every concurrent experiment partitions the population the same way, so a unit’s variant in test A is correlated with its variant in test B — the tests confound each other. Salting per experiment makes concurrent randomizations statistically independent (orthogonal).

Layers / domains: orthogonal vs exclusive

Mode
Behavior
When to use
Orthogonal layers
A unit can be in many experiments at once; each layer randomizes independently
Default — most tests don't interact, and you want concurrency
Exclusive layers (mutual-exclusion group)
A unit is in at most one experiment in the group
Tests that would interact (e.g., two changes on the same UI surface)

Name both. Orthogonal layers maximize throughput; exclusive layers buy you isolation when interference between treatments is plausible.

Bucket budget

A common choice is 1,000–10,000 buckets on a number line. More buckets means finer allocation granularity and more concurrent non-overlapping tests, at the cost of bookkeeping. Reusing buckets for a sequential follow-up test requires a re-randomization salt, or residual effects from the prior experiment leak in (carryover bias).

Sticky bucketing

Assignment must be consistent across sessions, devices, and app restarts. The classic correctness trap: a user is assigned by anonymous id pre-login, then logs in and gets re-hashed by user id — flipping their variant mid-experiment. The platform must reconcile anonymous-to-user identity so the unit stays in one arm.

Code sketch

def assign(exp_id, salt, unit_id, variants, n=10000):
h = murmur3(f"{exp_id}:{salt}:{unit_id}")
bucket = h % n
width = n // len(variants)
return variants[min(bucket // width, len(variants) - 1)]

Stateless, O(1), no network call — safe to run on the hot request path millions of times per second.

2

Flag delivery & exposure logging

The serving plane evaluates flags in microseconds and emits the single load-bearing event the entire stats engine depends on: the exposure.

Local vs remote evaluation

Approach
Latency
Verdict
Local in-process SDK eval
Sub-ms, often sub-microsecond
Required on the request path
Remote per-call eval (~20 flags)
200ms+ round trips
Unacceptable on the hot path

Rules are synced to a client/server SDK and evaluated in-process. A control plane publishes the ruleset (config) to a CDN/edge plus an SDK cache, refreshed via periodic poll or stream. Seconds of config staleness is fine; the hot path must never block on the network to decide a variant.

Exposure semantics (triggered analysis)

Log a unit as exposed only when the treatment is actually evaluated — when the code path that branches on the flag runs — not when the unit is merely eligible. This is triggered / exposure-based analysis. Counting eligible-but-unexposed units dilutes the measured effect toward zero, because you’re averaging the treated outcome over people who never saw the treatment.

Exactly-once-ish

Exposures are extremely high volume and lossy by nature. Don’t promise exactly-once delivery; instead dedupe by (unit_id, exp_id, day) and make the downstream join idempotent, so retries and duplicates don’t double-count.

Concern
Mechanism
Duplicate exposures
Dedupe key (unit_id, exp_id, day)
Retries
Idempotent join in the metric pipeline
Loss
Tolerated statistically; monitored per-arm volume as an SRM tripwire

Kill-switch path

A variant must be force-disable-able within minutes via a config push that the SDK picks up on its next poll. Decouple “turn it off now” from the experiment lifecycle — killing a bad variant should never require ending or recomputing the experiment.

The trap to name

Lazy or late exposure logging biases the denominator and is the #1 way a perfectly correct stats engine still produces a wrong answer. If exposures arrive late or are attributed to the wrong day, the per-variant counts are wrong before any statistics run.

3

Metric pipeline, SRM detection & guardrails

This is where raw events become trustworthy per-variant statistics — and where the trust gate lives.

Pipeline shape

exposures ⋈ metric events
-> per-unit metric aggregates
-> per-variant moments (mean, variance, n)
-> scorecard

Run a batch path (hourly/daily) for the scorecard, plus a faster near-real-time path that feeds guardrails and kill-switches. Store moments (counts, sums, sums-of-squares), not raw rows — they’re sufficient statistics, and the same aggregates cheaply feed t-tests, CUPED, and sequential bounds at thousands-of-tests scale.

SRM as a gate, not a metric

Sample Ratio Mismatch detection runs before any effect is read. It is a chi-squared goodness-of-fit test on observed vs expected unit counts per variant:

chi2 = Σ (observed_i - expected_i)^2 / expected_i

A 50/50 split that lands 50.3/49.7 across millions of units is not noise — it’s almost always a real defect: a logging bug, a redirect that’s slower on one arm, or asymmetric bot filtering. If SRM trips, you invalidate the experiment — you do not footnote it and read the effect anyway, because whatever broke the split almost certainly biased the metric too.

SRM thresholds (real industry values)

Threshold
Used by / rationale
p < 0.01
Common default alert
p < 0.001
Used by some tooling
p < 0.0005
Microsoft's conservative bar to suppress false alarms

Below threshold => decision-invalidating, not advisory.

Common SRM root causes (recite these)

  • Assignment-vs-exposure mismatch (assigned in one arm, logged in another)
  • Asymmetric bot/crawler filtering between arms
  • Redirect or latency differences between treatment and control
  • Sticky-bucketing leaks across the anonymous-to-login boundary

Guardrails + auto kill-switch

Guardrail metrics — latency, error rate, crash-free sessions, revenue — are monitored continuously. Because they’re checked continuously, they must use a sequential / always-valid bound (see Step 5), so an auto kill-switch can trip the moment a guardrail regresses without inflating false alarms from repeated looks.

4

Variance reduction: CUPED & power

This is the single biggest throughput lever the platform has, and it’s where “experimentation” becomes “causal inference.”

Why variance is the enemy

Required sample size scales with variance / MDE² (MDE = minimum detectable effect). So cutting variance by X% cuts the users or days needed by ~X% for the same detectable effect. Lowering variance is mathematically equivalent to making every experiment faster — across thousands of tests, that’s enormous capacity.

CUPED mechanics

CUPED (Controlled-experiment Using Pre-Experiment Data) adjusts each unit’s metric using a pre-period covariate:

Y_adj = Y - θ (X - E[X])
θ = Cov(Y, X) / Var(X)

Here X is the same metric measured in the pre-experiment period. Because X is from before treatment, it cannot be affected by treatment, so subtracting it removes pre-existing between-user noise (some users are just heavier than others) without biasing the treatment effect. It’s a control-variate / regression-adjustment idea: explain away the variation you can predict, and what’s left is a cleaner read on the treatment.

The sample-size win

Setting
Variance reduction
Effect
Typical range
20–50%
~20–50% fewer users/days
Etsy (reported)
~7% average, up to ~30%
~1 day earlier decision on average
Microsoft (reported)
40–50% on queries-per-user
Large power gain on a key metric

Quote the range, not a single hero number — the reduction depends entirely on how predictive the covariate is. Etsy’s later ML-based successor (predicted control variates / CUPAC) pushed the duration win further (~3 days), but that’s a different method than plain CUPED.

Pitfalls

Dilution caveat: CUPED only helps units that have pre-period data. If 50% of traffic is new users (no history) and the returning-user reduction is 40%, the population-level reduction is only ~20%. Choose the covariate and segment honestly — don’t report the returning-user number as if it applied to everyone.

Adjacent levers

  • Stratification / post-stratification — balance or reweight by known segments.
  • Variance-weighted estimators — weight units by inverse variance.
  • Capping / winsorizing heavy-tailed metrics (revenue) before testing, so one whale doesn't dominate the variance.

MLEs own this because CUPED is the cleanest signal that someone actually understands the stats layer — it’s regression adjustment dressed as an experimentation feature.

5

Stopping rules: sequential testing & the peeking problem

This is the dimension that most cleanly separates a calculator from a decision engine.

Why peeking lies

A fixed-horizon test is statistically valid only at its single pre-planned sample size. If you repeatedly check a “95% confident” test and stop the first time it crosses significance, the real false-positive rate is far above 5% — with frequent peeking it climbs toward 20–30%+. You’re effectively running many tests and keeping the one that looks good. A “significant” result read off a peeked fixed-horizon test is, bluntly, a lie.

Why it matters here specifically

Engineers will watch the dashboard every day, and guardrails must check continuously to power kill-switches. So the platform’s default stopping rule has to be valid under continuous monitoring — otherwise every kill-switch trip and every dashboard glance is a peeking violation built into the product.

Three regimes

Regime
Valid peeking
Power per sample
Commit in advance?
Fixed-horizon
None
Maximum
Sample size only
Group-sequential (O'Brien-Fleming / Pocock)
At pre-set interim looks
High
Yes — the look schedule
Always-valid (mSPRT, confidence sequences)
Any time, unlimited
Lower (wider intervals)
No

Always-valid / anytime-valid inference (mSPRT; Johari et al.) produces confidence sequences whose coverage holds at every time point — peek as often as you like, stop whenever you want. The price is wider intervals (less power per unit of data).

Group-sequential (alpha-spending via O’Brien-Fleming or Pocock) is the middle ground: valid at a pre-committed number of interim looks, tighter than always-valid, but you must fix the look schedule up front.

Fixed-horizon gives maximum power per sample but zero valid peeking — only safe if the org can truly resist looking, which at scale it cannot.

The staff move

Make sequential the platform default for ramps and guardrail decisions (where continuous monitoring is unavoidable), and offer fixed-horizon for pre-registered confirmatory tests where a team commits to a horizon. Match the stopping rule to how the result will actually be consumed, not to whichever has the prettiest power curve.

6

Deep dive — when randomization breaks: interference, switchback & quasi-experiments

WHERE STAFF IS WON

This is where staff is won. A candidate who applies a clean user-level A/B t-test to a marketplace launch has shipped a confidently wrong number. The bar is recognizing which assumption broke and reaching for the right alternative design.

SUTVA & interference

User-level randomization rests on SUTVA (Stable Unit Treatment Value Assumption): one unit’s treatment does not affect another unit’s outcome. In a two-sided marketplace this is routinely false. A treatment that lifts demand in the treatment arm steals finite supply (drivers, listings, inventory) from the control arm — so control’s outcome gets worse because treatment got better. The measured difference now blends the true effect with cannibalization, and the estimate is biased, often badly. Pricing, ranking, matching, and notification experiments are the usual offenders.

Switchback experiments

Instead of randomizing users, randomize treatment over geo × time-window units — the standard design at ride-share, delivery, and dynamic-pricing companies. The unit of randomization is a region-timeslice (e.g., “city X, 7–8pm”), large enough to contain the interference: within a switchback window the whole market is on one arm, so supply isn’t being stolen across arms. The analysis cost is temporal / serial correlation — consecutive windows aren’t independent, so you must account for it (block bootstrap, HAC-style variance, or a model with autocorrelation) or your standard errors lie.

Cluster randomization

Randomize at the level of a connected component — a city, a social cluster, a market — so that interference stays inside a cluster and doesn’t cross the treatment/control boundary. The trade is statistical power: your effective N drops from millions of users to a few dozen clusters, so you need a larger effect or more clusters to detect anything. You buy unbiasedness with variance.

Quasi-experiments (when you can’t randomize)

Some changes can’t be randomized at all: a pricing law applies to everyone, a brand TV campaign hits a whole region, a data migration is one-way.

Method
Core assumption
Use case
Difference-in-differences
Parallel trends (treated & control would have moved together absent treatment)
Policy/price change in some geos, not others
Synthetic control
A weighted combo of untreated geos reconstructs the treated unit's counterfactual
One treated region, many candidate controls
Geo holdouts
A set of regions deliberately untreated
Marketing / brand lift you can't randomize per user

Difference-in-differences compares the change in treated vs the change in control, canceling out fixed differences — but it dies if trends weren’t parallel pre-treatment, so always plot the pre-period. Synthetic control builds a “fake control” as a weighted blend of donor geos that tracks the treated unit before treatment, then reads the gap after.

Heterogeneity (CATE)

The average treatment effect can hide a harmed segment. Conditional Average Treatment Effect estimation finds who the treatment helps vs harms, using methods like the X-learner or causal forest (commonly benchmarked on uplift datasets such as Criteo Uplift v2.1, ~14M records). This powers targeting (ship the treatment only to the segment it helps) and is a guardrail in its own right: a positive average with a badly harmed subgroup is a no-ship, not a win.

The staff signal

Naming the failure of naivety is the credential. Anyone can run a t-test. The staff-level move is to say: “user-level randomization violates SUTVA here, so the clean A/B number is biased — switch to switchback, or cluster, or fall back to diff-in-diff / synthetic control.” Knowing the assumption broke, and the right alternative, is the bar.

7

Org scale: many tests, FDR, interactions & novelty

At thousands of tests with dozens of metrics each, statistics that are fine in isolation generate a flood of false wins.

False-discovery inflation

At α = 0.05, ~1 in 20 null comparisons is a false positive by construction. Thousands of tests × dozens of metrics each means a steady stream of spurious “significant” results even if nothing is real. Uncorrected, the scorecard becomes a slot machine that always eventually pays out.

FDR control

Method
Controls
Verdict at this scale
Bonferroni
Family-wise error rate
Far too conservative — kills real wins
Benjamini-Hochberg
False Discovery Rate (expected fraction of declared wins that are false)
Preferred

Use Benjamini-Hochberg. It controls the FDR — the expected proportion of your declared wins that are actually false — which is the quantity a product org actually cares about. Worked example: if you declare 20 metrics significant under FDR 0.05, you should expect ~1 of those 20 to be a false positive. That’s a tolerable, quantified error rate, versus Bonferroni’s posture of rejecting almost everything.

Interaction effects

Orthogonal layers assume tests don’t interact. Two flags that touch the same surface (say two competing UI changes) can interact, so their combined effect isn’t the sum of their individual effects. Put such flags in an exclusion group, or run an explicit interaction analysis.

Novelty & primacy

Early metrics mislead: novelty effects inflate early numbers (users click the shiny new thing), while primacy / learning-curve effects deflate them (users need time to adjust). Both fade. Require a minimum run length, and for big launches keep a long-term holdback to measure durable impact rather than the first-week spike.

Decision scorecard

Ship/no-ship gates on all of:

1. Primary metric significant (under the chosen valid stopping rule)

2. No guardrail regression

3. SRM-clean

4. Minimum runtime met

Automated across the entire portfolio — adjudicated by the engine, not by hand, because at thousands of tests hand-adjudication is where false wins sneak through.

8

Failure modes, observability & rollout

The two planes fail in opposite directions, and the design has to encode that.

Serving plane fails safe

If an SDK can’t reach config, it serves the last-known-good ruleset or the control default. A flag system must never fail open to a half-rolled-out variant — an outage should look like “everyone’s on control,” never “everyone’s on the untested experimental code path.”

Stats plane fails loud

If SRM trips or exposures are missing, the stats plane refuses to emit a decision rather than showing a clean-looking but invalid result. A blank scorecard with a loud error is correct; a confident-looking number computed on a broken split is a disaster.

Failure mode
Plane
Mitigation
Config unreachable
Serving
Serve last-known-good / control (fail safe)
Half-rolled-out variant on outage
Serving
Default to control; never fail open
SRM trips
Stats
Invalidate; refuse to emit (fail loud)
Missing/late exposures
Stats
Block decision; alert on per-arm volume
Guardrail regression
Both
Auto-rollback via config push (kill-switch)
Bucket reuse carryover
Assignment
Re-randomize with a new salt

Observability

Tag every trace, metric, and log with the active variant assignments, so an APM latency spike can be attributed to a specific variant. Monitor exposure volume per arm in real time — a sudden imbalance is the earliest SRM tripwire, often visible before the batch chi-squared even runs.

Rollout discipline

Ramp 1% → 5% → 50% with guardrails armed at every step. On a guardrail breach, auto-rollback via the kill-switch path — a config push the SDK picks up on its next poll. The ramp limits blast radius; the kill-switch bounds time-to-recover.

Cost / scale reality

Store sufficient statistics (counts, sums, sums-of-squares), not raw rows. That keeps the stats engine cheap enough to recompute every scorecard for thousands of tests on every batch cycle — which is what makes portfolio-wide automated gating affordable in the first place.

Summary

Thesis: the platform is a trust contract. Every result is invalid until it is SRM-clean, exposure-correct, and read under a valid stopping rule. The stats/decision engine is the product; flag delivery is table stakes.

The four staff signals:

1. Separates the planes — sub-ms fail-safe serving vs fail-loud batch stats, each with the right consistency model.

2. Defaults to sequential inference — so daily dashboard peeking and continuous kill-switches are valid by construction, not violations.

3. Treats SRM as a decision-invalidating gate — a broken split kills the experiment, it doesn’t earn a footnote.

4. Knows when randomization breaks — detects SUTVA/interference in marketplaces and reaches for switchback, cluster, or quasi-experimental (diff-in-diff / synthetic control) designs.

Landmines that fail a candidate:

  • A t-test on peeked data.
  • Ignoring SRM (or treating it as just another metric).
  • Counting eligible-but-not-exposed units (dilution).
  • Applying user-level A/B to a two-sided marketplace.
  • Uncorrected multiple comparisons across thousands of tests.

Switcher takeaway: your flag, pipeline, and low-latency infra fluency is ~60% of this system. Adding the inference layer — CUPED, sequential testing, SRM, FDR, CATE — is the credential that moves you from platform SDE into AI/ML platform roles.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Problem framing & trust contract
Designs flag delivery + a t-test on the metric; treats a significant p-value as the answer.
Frames the platform as a trust contract: every result is invalid until SRM, exposure integrity, and stopping rule are cleared; decision engine, not just a calculator.
Assignment & consistency
Hashes user_id to a bucket; knows assignment must be deterministic.
Salts hash with experiment/layer id, designs orthogonal vs exclusive layers, sticky bucketing across sessions/devices, and reasons about bucket count vs concurrency and reuse bias.
Exposure & metric integrity
Logs an event when the user enters the experiment; joins to metrics in a batch job.
Triggered analysis (only count exposed units), exactly-once exposure semantics, dilution control, and idempotent metric pipeline; knows late/duplicate exposures bias the estimate.
Variance reduction & power
Computes sample size from baseline + MDE; runs to fixed horizon.
Applies CUPED with a pre-period covariate for ~20-50% variance reduction (=> ~same % fewer users / days), and reasons about new-user dilution and covariate choice.
Stopping rules & peeking
Runs to a fixed sample size; aware peeking is bad.
Implements always-valid inference (mSPRT) or group-sequential (O'Brien-Fleming) so continuous monitoring and kill-switches don't inflate Type-I error; explains the tradeoff vs fixed-horizon power.
Causal rigor at scale
Reports average treatment effect per metric.
Controls FDR (Benjamini-Hochberg) across many metrics/tests, estimates CATE (X-learner / causal forest) for heterogeneity, and handles novelty/primacy and interaction effects.
When randomization breaks
Assumes user-level randomization always works.
Detects SUTVA/interference in two-sided markets, switches to switchback (geo×time) or cluster randomization, and uses diff-in-diff / synthetic control when a live test is impossible.
Operability & org scale
One pipeline; results in a dashboard.
Separates sub-ms assignment plane (must never fail) from batch stats plane (must never lie); guardrail metrics with auto kill-switch, SRM alerting, and a scorecard that gates ship/no-ship across thousands of tests.
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →