Design a Real-Time Payment Fraud & Risk Scoring System
A hard real-time risk-scoring gate that returns allow / review / block in under 100ms on the money-critical authorization path. The hard parts are asymmetric costs (blocking a good customer vs. eating fraud loss), extreme class imbalance with delayed adversarial labels, and a low-latency online feature store feeding a co-located model with a deterministic rule fallback. This is the kind of system Stripe (Radar), Block, PayPal, and Adyen build and interview AI/ML and infra candidates on.
Frame the gate & the asymmetric objective
“This isn’t a ranking problem — it’s a real-time gate on the money path where a wrong block is an instant, visible business loss and the other side adapts to beat you.”
Most candidates reach for an offline classifier and an AUC number. That answer loses here. The system is a synchronous decision gate that sits inside the card-authorization path: every transaction waits on your verdict before the money moves, the answer is due in under 100ms, and the consequences of being wrong are dollar-denominated and asymmetric. Lead with that framing and the rest of the design follows.
Who asks this & what they probe
Three very different interviewers hide behind this prompt. Read which one you have and lead accordingly.
Three actions, not two. The decision is not block-or-allow
The gate returns one of three verdicts:
- ALLOW — let the authorization proceed.
- BLOCK — decline the transaction outright.
- REVIEW / STEP-UP — route the transaction to a 3D Secure (3DS) challenge or a human review queue. A successful 3DS challenge shifts chargeback liability to the issuing bank, which is the single most important lever juniors miss: it converts a binary, lossy block/allow choice into a graded response that can recover a risky-but-legitimate sale at low cost.
The costs are asymmetric and dollar-denominated. Optimize expected cost, not accuracy
Accuracy is meaningless when one class is a fraction of a percent of traffic. Worse, the two error types cost wildly different amounts, so even AUC is the wrong north star. Frame the four confusion-matrix cells in dollars:
A False Negative (allowed fraud) costs the transaction amount plus a chargeback fee of roughly $15-25 plus, at scale, network monitoring-program fines if your fraud ratio crosses a threshold. A False Positive (blocked good transaction) costs the margin on a lost sale plus eroded trust plus possible permanent churn of a good customer — much harder to measure, easy to ignore, and the failure mode that quietly destroys revenue.
Scale anchors. A large network processes thousands of transactions per second at peak. Fraud is a small fraction of attempts, but losses are concentrated in a few high-value patterns. That combination — rare events, concentrated dollar loss, asymmetric costs — is exactly why the objective is expected dollar cost, not accuracy or even AUC.
Explicitly out of scope. This is the authorization-time risk gate. The weeks-later chargeback dispute / representment system is a separate pipeline — we acknowledge it exists because it is our primary label source, but we are not designing the dispute workflow here.
Requirements, SLOs & the cost function
Functional requirements
- Score every authorization attempt and return a calibrated risk score plus an action (allow / review / block).
- Support per-merchant and per-segment policy overrides (different merchants have different risk appetites and margins).
- Expose a human-readable reason for each decision, for the review queue and for dispute / representment evidence.
Non-functional requirements & SLOs
Two subtleties worth stating out loud. First, p99 is the SLO that matters, not the mean — a mean of 20ms with a fat tail still times out real authorizations. Second, fail-safe, not fail-open: if scoring is unavailable, we do not allow everything (that is a fraudster’s dream); we fall back to a deterministic rule engine that still makes a real decision.
The objective is expected cost
State the cost function explicitly — this is what separates a Staff answer from “pick a threshold around 0.5”:
The optimal decision threshold is the point where the marginal cost of blocking one more transaction equals the marginal cost of allowing it — emphatically not 0.5 on a raw model output. Because C_FN scales with the transaction amount, the threshold itself must scale with amount (more on this in Step 5).
Idempotency. Scoring is keyed to the payment-attempt id. Network retries and client re-sends are common; without an idempotency key, a retry inflates velocity counters and can flip a decision between attempts of the same logical payment. Same attempt id → same cached decision and no double-counted state.
Hot-path serving architecture
The request path
The scoring call is a synchronous gRPC request with a hard deadline. Features are fetched in parallel (fan-out), the model runs in-process, and the full feature vector plus score plus action is emitted asynchronously to a log stream — that log is the raw material for training and for online/offline parity checks (Step 7).
Online feature store
An in-memory key-value store (Redis, or a DynamoDB-class managed KV) holds velocity counters and per-entity state. Redis GET runs at roughly p50 ~0.12ms and p99 ~0.45ms, so the cost of reading any single feature is trivial — as long as fetches are parallelized rather than issued serially. Ten serial 0.45ms reads is 4.5ms of avoidable tail; ten parallel reads is ~0.45ms. Counters are updated with atomic operations (INCR, hash-field increments) so the current attempt is reflected without a read-modify-write race between concurrent transactions on the same card.
Model co-location
The model runs in the serving process (in-process library, or a sidecar on the same host) so the hot path never makes a network hop to a separate model server. A cross-host round trip to a model-serving cluster would add 1-10ms of tail and a new failure dependency for a model that itself only needs 1-2ms to run. Co-locate it.
Deadline & graceful degradation
Every request carries a hard deadline. If feature fetches or inference threaten to blow the budget, we abort the ML path and fall back to a deterministic rule engine — blocklists, hard velocity caps, amount and geo rules. The payment still resolves with a real decision; it is never left hanging on a slow model. The fallback is also the floor of safety: even when the model is healthy, the rules run as guardrails.
p99 latency budget
The lesson of the budget: inference is the cheapest line item. The discipline is bounding the online feature set and parallelizing its fetch, not optimizing the model’s FLOPs.
Feature engineering & the entity graph
Velocity & entity state
The richest cheap signal is velocity — rates of activity across multiple entities at multiple time windows:
- Entities: card, device, IP, email, merchant.
- Windows: 1 minute / 1 hour / 24 hours / 7 days.
- Aggregations: count of attempts, distinct cards seen per device, distinct merchants per card, decline rate, sum of amounts.
A device that just tried fourteen distinct cards in two minutes is a near-certain card-testing attack, and that signal is a handful of counter reads. Network-wide per-card state is the platform’s structural advantage: because we see a card across the entire network, not just one merchant, we know whether this card has been probed elsewhere in the last hour and in what pattern — something no single merchant can see.
Categorical embeddings
High-cardinality categorical fields — issuer BIN, merchant id, country, MCC (merchant category code) — are poorly served by one-hot encoding (millions of sparse columns, no generalization). Learn dense embeddings so the model can express “this issuer+country pair behaves like that one,” generalizing across rare combinations instead of treating each as an isolated indicator.
Graph associations
Organized fraud is a ring: many accounts sharing a few devices, IPs, or cards. Catch it with graph features — neighborhood aggregates over a transaction graph (shared devices/IPs/cards across accounts), computed with GNN-style methods (GraphSAGE, R-GCN). Full graph traversal is far too slow for a sub-100ms hot path, so these neighborhood embeddings are computed offline in batch and cached, then read like any other feature at decision time.
Online-cheap vs precomputed, and point-in-time correctness
Two non-negotiables. Point-in-time correctness: training features must be reconstructed as of the decision time — never leak a counter value or a label that only existed after the transaction, or you train a model that can’t exist in production. Feature parity: the online and offline definitions of every feature must be byte-for-byte identical, or the model sees different inputs in training and serving and silently degrades.
Modeling under extreme imbalance + novelty
Supervised core
For tabular fraud signals, gradient-boosted decision trees (XGBoost-class) are the proven workhorse: strong on heterogeneous tabular features, fast to train, fast to score, and naturally robust to scale and missing values. At very large platforms with rich signal, the heavy model has moved toward deep architectures — Stripe’s Radar, for example, moved off pure XGBoost toward a ResNeXt-inspired deep network to scan far more signals per transaction. The right default in an interview is GBDT, with deep nets named as the scale-up path.
Handling extreme imbalance
Fraud is a tiny fraction of attempts, so the modeling problem is dominated by imbalance:
Prefer cost-sensitive learning — weight a false negative far more heavily than a false positive in the loss — over resampling. SMOTE-style synthetic minority oversampling tends to lift recall at the cost of precision, and in this domain lost precision means false declines, the expensive, invisible error.
Unsupervised novelty layer
The supervised model can only catch patterns it has labels for. The adversary’s whole job is to invent patterns you’ve never seen. So run an anomaly layer in parallel — an isolation forest, or an autoencoder scored on reconstruction error — that flags transactions far from the normal manifold even with no fraud label. This is the early-warning sensor for novel attacks.
Why not just one model
Combine three signals into an ensemble, each with a distinct job:
- Supervised score — the calibrated probability of fraud on known patterns.
- Anomaly score — novelty / never-seen-before signal.
- Hard rules — fast, auditable guardrails and the timeout fallback from Step 2.
Metric discipline
Optimize PR-AUC and recall at a fixed precision, plus dollar-weighted loss — never plain accuracy or ROC-AUC. With a negative class that dominates, ROC-AUC looks great while the model is useless at the operating point you actually care about; precision-recall and dollar-weighted metrics tell the truth about the rare, expensive positives.
Calibration & the allow / review / block thresholds
Calibration
A gradient-boosted tree’s raw output is a score, not a probability — a 0.7 does not mean 70% fraud. To compute expected cost you need a real probability, so calibrate: fit Platt scaling (a logistic on the raw score) or isotonic regression on held-out data so that a calibrated 0.02 means a genuine 2% fraud probability. Only after calibration can the threshold be derived from the cost function rather than guessed.
From score to action
Two thresholds carve the calibrated score line into three regions:
T_high and T_low come from the cost curve, not intuition. The Bayes-optimal decision: block when the expected cost of allowing exceeds the expected cost of blocking —
Because the left side scales with amount, the threshold moves with the transaction value: you should block a $5,000 transaction at a much lower fraud probability than a $5 one. A single global probability threshold is wrong by construction.
The review band
The middle region is where step-up earns its keep. Routing the uncertain band to a 3DS challenge inconveniences a good customer minimally (one extra auth step) while, on success, shifting chargeback liability to the issuer. That turns a hard, lossy block/allow decision into a graded response — the third action from Step 0, now operationalized.
Per-segment thresholds. Different merchants, geographies, and product types have different fraud base rates and different margins, so T_low and T_high are policy-driven per segment, not one global pair of numbers.
Watch both sides. Monitor false-decline rate and approval rate right next to fraud rate. Over-tightening to crush fraud silently destroys revenue — the failure mode juniors never instrument because the fraud dashboard looks fantastic while the business bleeds.
Deep dive — delayed, adversarial & biased labels
WHERE STAFF IS WONThis is where a Staff answer is won. Everything above assumes you have clean labels to train on. You do not. Fraud labels are delayed, selection-biased, and adversarial all at once, and handling that is the hardest and most distinctive part of the system.
Label delay & right-censoring
A fraud label arrives as a chargeback, which lands weeks to ~90+ days after the transaction. So a transaction that looks “good” today is really “no chargeback yet” — not confirmed legitimate.
Recent data is right-censored: training on the last few weeks naively under-counts fraud, because the chargebacks simply haven’t posted yet. Treat label maturity explicitly — only treat a transaction as a confirmed negative once its dispute window has closed, and model the censoring rather than pretending recent “good” labels are final.
Selection bias / the feedback loop
The deeper trap: you only observe outcomes for transactions you ALLOWED. A transaction you blocked never reaches the network, so it never produces a chargeback — it has no ground truth at all. Train naively and the model learns from a censored slice of its own past decisions, growing ever more confident about a boundary it can no longer see past.
Mitigations:
- Reject inference — statistically infer the likely outcome of rejected transactions so they don't vanish from the training distribution.
- Randomized exploration budget — deliberately allow through a small, randomized fraction of transactions you would have blocked, accepting bounded fraud loss to buy an unbiased label signal at the boundary. This is the only way to know what's happening in the region you keep blocking.
Adversarial drift
This is concept drift caused by an adversary, not by nature. Fraudsters actively probe your boundary, adapt, and re-attack. A static model decays fast because the decision boundary is a live target someone is paid to defeat. You cannot retrain quarterly and expect to keep up.
Two-speed architecture
Resolve the tension between “labels take 90 days” and “the adversary moves in hours” by splitting the system into two layers running at different clocks:
The fast layer is rules + a frequently-retrained light model + the anomaly detector from Step 4. It reacts to a new card-testing pattern within hours — long before a single chargeback for that pattern has posted. The slow layer provides the stable, well-calibrated backbone trained only on labels that have fully matured.
Keep a clean evaluation signal
Maintain a small randomized challenge / holdout stream and watch score- and feature-distribution drift. A quiet shift in the input distribution is the early warning that fires before the fraud rate visibly moves — by the time fraud rate spikes, you’ve already lost weeks. Input drift is the smoke detector; fraud rate is the fire alarm.
Frequent retrain, gated rollout
“The ability to retrain on new data is what enables adaptation to new fraud,” so retrain often and roll models out rapidly. But a bad fraud model is an instant, network-wide loss, so every model ships behind shadow + canary (Step 7). Fast iteration and safe rollout are not in tension — the canary is what lets you iterate fast without betting the network on each new model.
Reliability, deployment & human-in-the-loop
Safe model rollout
Never flip a fraud model globally in one step. Roll out in stages:
1. Shadow — the new model scores live traffic, but its decisions are not applied; compare its would-be decisions against production.
2. Canary — apply it to a small percentage of traffic with automated rollback triggered by any regression in fraud rate or decline rate.
3. Ramp — widen only after the canary holds on both fraud and false-decline metrics.
Review queue
The REVIEW action feeds a human-analyst queue with an SLA. This closes the label loop much faster than chargebacks: an analyst’s verdict becomes a high-quality label within hours, not 90 days, and feeds straight back into the fast-layer training set. Human review is not just risk mitigation — it’s your fastest clean label source.
Observability
- Online/offline parity guard — continuously verify that features served online equal features computed offline for the same event. Training/serving skew here silently degrades the model and is a classic, hard-to-find production bug.
- Idempotency & audit — log every decision with its exact feature vector, model version, and reason, both as dispute / representment evidence and to debug adversarial probing after the fact.
Dashboards & alerts:
- p99 latency and fallback-invocation rate (how often the rules fired instead of the model)
- approval rate, false-decline rate, fraud rate (settled + censoring-adjusted estimate)
- score-distribution drift, feature-freshness lag
Privacy / compliance. Stay inside PCI-DSS scope: use network tokenization so the hot path handles tokens, not raw PANs, and respect data residency by computing region-local features where required.
Tradeoffs, failure modes & what I'd cut
Tensions I’m holding
The right answer to the fraud-vs-decline tension is always the cost curve. The wrong answer — the one that sounds responsible and quietly destroys revenue — is “just raise the threshold.”
What I’d cut for v1
Ship the supervised GBDT + rules + calibrated two-threshold decision + deterministic rule fallback first. Defer the GNN/graph layer and the reject-inference loop to v2, once the serving SLO is proven in production. The hot-path latency budget and the fallback are non-negotiable from day one; the fancier features are precision gains you add after the gate is reliably fast.
How I’d know it’s failing
- Silent over-blocking — fraud rate looks great while revenue quietly drops. Caught by the false-decline / approval-rate dashboards, never the fraud dashboard.
- Label feedback collapse — the model stops seeing the fraud it blocks and drifts toward over-confidence. Caught by the randomized exploration holdout.
- Adversarial boundary mapping — attackers binary-search your threshold with small probes. Mitigated by randomized challenge thresholds and frequent retrains so the boundary keeps moving under their feet.
Summary
What a Staff answer leaves on the table — four differentiators:
- It's a hard real-time gate on the money path, not a best-effort ranker. A deterministic fallback and a p99 budget are non-negotiable, and the payment must resolve even when the model doesn't.
- The decision is calibrated expected-cost minimization with three actions (allow / review-step-up / block), where the thresholds scale with transaction amount — not a fixed 0.5 on an uncalibrated score.
- Labels are delayed, adversarial, and selection-biased. The standout move is the two-speed loop (fast adaptive + slow stable) plus a randomized exploration budget to keep an unbiased signal.
- Staff signal = naming the failure modes juniors miss: silent over-blocking destroying revenue, feedback-loop label collapse, and an adversary actively mapping your boundary.
If you remember nothing else: optimize dollars of expected cost under a 100ms deadline against an adversary who gets weeks of label delay to study you.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.