AI System DesignStaffReal-Time MLFraud / Risk

Design a Real-Time Payment Fraud & Risk Scoring System

A hard real-time risk-scoring gate that returns allow / review / block in under 100ms on the money-critical authorization path. The hard parts are asymmetric costs (blocking a good customer vs. eating fraud loss), extreme class imbalance with delayed adversarial labels, and a low-latency online feature store feeding a co-located model with a deterministic rule fallback. This is the kind of system Stripe (Radar), Block, PayPal, and Adyen build and interview AI/ML and infra candidates on.

Level: Staff
Category: AI System Design
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Can you hold a sub-100ms p99 on the money path while still fetching network-wide features?

·Do you make allow/review/block a calibrated cost decision, not an arbitrary 0.5 threshold?

·Do you have a deterministic fallback when the model times out — and does the payment still resolve?

·Do you handle delayed, adversarial, biased chargeback labels in the training loop?

★ STAFF-LEVEL SIGNALS

★Frames the objective as expected-cost minimization with explicit FP/FN dollar costs, not AUC

★Introduces a third action (step-up to 3DS) that shifts liability instead of a binary block

★Names the selection-bias trap: you only learn labels for transactions you allowed

★Separates the fast adversarial-adaptive layer from the slow stable layer and justifies why

Frame the gate & the asymmetric objective

“This isn’t a ranking problem — it’s a real-time gate on the money path where a wrong block is an instant, visible business loss and the other side adapts to beat you.”

Most candidates reach for an offline classifier and an AUC number. That answer loses here. The system is a synchronous decision gate that sits inside the card-authorization path: every transaction waits on your verdict before the money moves, the answer is due in under 100ms, and the consequences of being wrong are dollar-denominated and asymmetric. Lead with that framing and the rest of the design follows.

Who asks this & what they probe

Three very different interviewers hide behind this prompt. Read which one you have and lead accordingly.

Interviewer

What they own

What they probe

SDE / Infra

The money-critical hot path

p99 budgeting, online feature-store reads, co-locating the model with serving, graceful degradation to deterministic rules when the model times out — the payment must resolve regardless

MLE / Applied ML

Decision quality

Feature design (velocity, device/IP/card graph), supervised + unsupervised choice under extreme imbalance, calibrating a score into allow/review/block under asymmetric FP vs FN cost, delayed adversarial labels

Switcher (SDE → AI)

Both, learning the ML half

Can frame a probabilistic component whose "correctness" is a tunable threshold on a cost curve, whose ground truth arrives weeks late, whose adversary probes the boundary — lead with serving, then reason about calibration and label delay

Three actions, not two. The decision is not block-or-allow

The gate returns one of three verdicts:

ALLOW — let the authorization proceed.
BLOCK — decline the transaction outright.
REVIEW / STEP-UP — route the transaction to a 3D Secure (3DS) challenge or a human review queue. A successful 3DS challenge shifts chargeback liability to the issuing bank, which is the single most important lever juniors miss: it converts a binary, lossy block/allow choice into a graded response that can recover a risky-but-legitimate sale at low cost.

The costs are asymmetric and dollar-denominated. Optimize expected cost, not accuracy

Accuracy is meaningless when one class is a fraction of a percent of traffic. Worse, the two error types cost wildly different amounts, so even AUC is the wrong north star. Frame the four confusion-matrix cells in dollars:

Predicted allow

Predicted block

Actually legit

Correct: full margin earned

False Positive: lost sale margin + eroded customer trust + possible permanent churn

Actually fraud

False Negative: txn amount + chargeback fee (~$15-25) + network fines / program thresholds

Correct: fraud avoided

A False Negative (allowed fraud) costs the transaction amount plus a chargeback fee of roughly $15-25 plus, at scale, network monitoring-program fines if your fraud ratio crosses a threshold. A False Positive (blocked good transaction) costs the margin on a lost sale plus eroded trust plus possible permanent churn of a good customer — much harder to measure, easy to ignore, and the failure mode that quietly destroys revenue.

Scale anchors. A large network processes thousands of transactions per second at peak. Fraud is a small fraction of attempts, but losses are concentrated in a few high-value patterns. That combination — rare events, concentrated dollar loss, asymmetric costs — is exactly why the objective is expected dollar cost, not accuracy or even AUC.

Explicitly out of scope. This is the authorization-time risk gate. The weeks-later chargeback dispute / representment system is a separate pipeline — we acknowledge it exists because it is our primary label source, but we are not designing the dispute workflow here.

Requirements, SLOs & the cost function

Functional requirements

Score every authorization attempt and return a calibrated risk score plus an action (allow / review / block).
Support per-merchant and per-segment policy overrides (different merchants have different risk appetites and margins).
Expose a human-readable reason for each decision, for the review queue and for dispute / representment evidence.

Non-functional requirements & SLOs

Property

Target

Why

Latency (p99)

<100ms end-to-end

Sits inside the issuer auth timeout; tail, not mean, is the SLO

Model inference

1-2ms

Tiny slice; budget is dominated by feature I/O

Availability

Gate never blocks the pipeline

If scoring is down, a deterministic fallback decides

Throughput

Peak 5-10x baseline

Holiday / flash-sale spikes must not degrade p99

Idempotency

Per payment-attempt id

Retries must not double-count velocity

Two subtleties worth stating out loud. First, p99 is the SLO that matters, not the mean — a mean of 20ms with a fat tail still times out real authorizations. Second, fail-safe, not fail-open: if scoring is unavailable, we do not allow everything (that is a fraudster’s dream); we fall back to a deterministic rule engine that still makes a real decision.

The objective is expected cost

State the cost function explicitly — this is what separates a Staff answer from “pick a threshold around 0.5”:

E[cost] = C_FP · P(false positive) + C_FN · P(false negative)

The optimal decision threshold is the point where the marginal cost of blocking one more transaction equals the marginal cost of allowing it — emphatically not 0.5 on a raw model output. Because C_FN scales with the transaction amount, the threshold itself must scale with amount (more on this in Step 5).

Idempotency. Scoring is keyed to the payment-attempt id. Network retries and client re-sends are common; without an idempotency key, a retry inflates velocity counters and can flip a decision between attempts of the same logical payment. Same attempt id → same cached decision and no double-counted state.

Hot-path serving architecture

The request path

payment gateway

│ (sync gRPC, deadline = 80ms)

▼

┌──────────────────────────────────────────────┐

│ risk scoring service │

│ │

│ parse + idempotency check │

│ │ │

│ ▼ │

│ parallel feature fetch ──► online feature │

│ (fan-out, bounded set) store (Redis/KV) │

│ │ │

│ ▼ │

│ assemble vector ─► model (co-located) ─► │

│ calibrate ─► threshold ─► action │

└───────────────┬──────────────────────────────┘

│ decision (allow/review/block)

▼ │

back to gateway └─► async emit

{features, score, action}

→ log stream (training)

The scoring call is a synchronous gRPC request with a hard deadline. Features are fetched in parallel (fan-out), the model runs in-process, and the full feature vector plus score plus action is emitted asynchronously to a log stream — that log is the raw material for training and for online/offline parity checks (Step 7).

Online feature store

An in-memory key-value store (Redis, or a DynamoDB-class managed KV) holds velocity counters and per-entity state. Redis GET runs at roughly p50 ~0.12ms and p99 ~0.45ms, so the cost of reading any single feature is trivial — as long as fetches are parallelized rather than issued serially. Ten serial 0.45ms reads is 4.5ms of avoidable tail; ten parallel reads is ~0.45ms. Counters are updated with atomic operations (INCR, hash-field increments) so the current attempt is reflected without a read-modify-write race between concurrent transactions on the same card.

Model co-location

The model runs in the serving process (in-process library, or a sidecar on the same host) so the hot path never makes a network hop to a separate model server. A cross-host round trip to a model-serving cluster would add 1-10ms of tail and a new failure dependency for a model that itself only needs 1-2ms to run. Co-locate it.

Deadline & graceful degradation

Every request carries a hard deadline. If feature fetches or inference threaten to blow the budget, we abort the ML path and fall back to a deterministic rule engine — blocklists, hard velocity caps, amount and geo rules. The payment still resolves with a real decision; it is never left hanging on a slow model. The fallback is also the floor of safety: even when the model is healthy, the rules run as guardrails.

p99 latency budget

Stage

Budget (p99)

Notes

Network in + out

10-20ms

Gateway ↔ risk service round trip

Parallel feature fetch

2-5ms

Bounded critical set, fanned out

Feature assembly

2-4ms

Vector build, encoding

Model inference

1-2ms

Co-located, GBDT or small deep net

Calibration + threshold

<1ms

Lookup + comparison

Headroom for tail

remainder

Keeps p99 inside 100ms

The lesson of the budget: inference is the cheapest line item. The discipline is bounding the online feature set and parallelizing its fetch, not optimizing the model’s FLOPs.

Feature engineering & the entity graph

Velocity & entity state

The richest cheap signal is velocity — rates of activity across multiple entities at multiple time windows:

Entities: card, device, IP, email, merchant.
Windows: 1 minute / 1 hour / 24 hours / 7 days.
Aggregations: count of attempts, distinct cards seen per device, distinct merchants per card, decline rate, sum of amounts.

A device that just tried fourteen distinct cards in two minutes is a near-certain card-testing attack, and that signal is a handful of counter reads. Network-wide per-card state is the platform’s structural advantage: because we see a card across the entire network, not just one merchant, we know whether this card has been probed elsewhere in the last hour and in what pattern — something no single merchant can see.

Categorical embeddings

High-cardinality categorical fields — issuer BIN, merchant id, country, MCC (merchant category code) — are poorly served by one-hot encoding (millions of sparse columns, no generalization). Learn dense embeddings so the model can express “this issuer+country pair behaves like that one,” generalizing across rare combinations instead of treating each as an isolated indicator.

Graph associations

Organized fraud is a ring: many accounts sharing a few devices, IPs, or cards. Catch it with graph features — neighborhood aggregates over a transaction graph (shared devices/IPs/cards across accounts), computed with GNN-style methods (GraphSAGE, R-GCN). Full graph traversal is far too slow for a sub-100ms hot path, so these neighborhood embeddings are computed offline in batch and cached, then read like any other feature at decision time.

Online-cheap vs precomputed, and point-in-time correctness

Feature group

Entity × window

Online-cheap or precomputed

How

Velocity counters

card/device/IP/merchant × 1m–7d

Online-cheap

Atomic increments, read at decision

Current-attempt fields

this txn (amount, MCC, geo)

Online-cheap

Parsed from the request

Categorical embeddings

issuer/merchant/country/MCC

Precomputed

Trained offline, cached, refreshed on a lag

Graph / ring features

account neighborhoods

Precomputed

Batch GNN, cached, refreshed on a lag

Two non-negotiables. Point-in-time correctness: training features must be reconstructed as of the decision time — never leak a counter value or a label that only existed after the transaction, or you train a model that can’t exist in production. Feature parity: the online and offline definitions of every feature must be byte-for-byte identical, or the model sees different inputs in training and serving and silently degrades.

Modeling under extreme imbalance + novelty

Supervised core

For tabular fraud signals, gradient-boosted decision trees (XGBoost-class) are the proven workhorse: strong on heterogeneous tabular features, fast to train, fast to score, and naturally robust to scale and missing values. At very large platforms with rich signal, the heavy model has moved toward deep architectures — Stripe’s Radar, for example, moved off pure XGBoost toward a ResNeXt-inspired deep network to scan far more signals per transaction. The right default in an interview is GBDT, with deep nets named as the scale-up path.

Handling extreme imbalance

Fraud is a tiny fraction of attempts, so the modeling problem is dominated by imbalance:

Technique

Pro

Con

Cost-sensitive learning (weight FN ≫ FP)

Directly encodes the real objective; no synthetic data

Needs a defensible cost ratio

Class weights

Simple, built into most libraries

Coarser than per-example dollar cost

Naive oversampling / SMOTE

Lifts recall

Often hurts precision → more false declines

Undersampling negatives

Faster training

Throws away signal, distorts base rate

Prefer cost-sensitive learning — weight a false negative far more heavily than a false positive in the loss — over resampling. SMOTE-style synthetic minority oversampling tends to lift recall at the cost of precision, and in this domain lost precision means false declines, the expensive, invisible error.

Unsupervised novelty layer

The supervised model can only catch patterns it has labels for. The adversary’s whole job is to invent patterns you’ve never seen. So run an anomaly layer in parallel — an isolation forest, or an autoencoder scored on reconstruction error — that flags transactions far from the normal manifold even with no fraud label. This is the early-warning sensor for novel attacks.

Why not just one model

Combine three signals into an ensemble, each with a distinct job:

Supervised score — the calibrated probability of fraud on known patterns.
Anomaly score — novelty / never-seen-before signal.
Hard rules — fast, auditable guardrails and the timeout fallback from Step 2.

p_sup = supervised_model(x) # calibrated in Step 5

a_score = anomaly_model(x) # reconstruction error / iforest

if rule_hit(x) or a_score > A_HIGH:

escalate(x) # → review/step-up or block

score = p_sup # base signal

action = map_to_action(calibrate(score), amount, segment)

Metric discipline

Optimize PR-AUC and recall at a fixed precision, plus dollar-weighted loss — never plain accuracy or ROC-AUC. With a negative class that dominates, ROC-AUC looks great while the model is useless at the operating point you actually care about; precision-recall and dollar-weighted metrics tell the truth about the rare, expensive positives.

Calibration & the allow / review / block thresholds

Calibration

A gradient-boosted tree’s raw output is a score, not a probability — a 0.7 does not mean 70% fraud. To compute expected cost you need a real probability, so calibrate: fit Platt scaling (a logistic on the raw score) or isotonic regression on held-out data so that a calibrated 0.02 means a genuine 2% fraud probability. Only after calibration can the threshold be derived from the cost function rather than guessed.

From score to action

Two thresholds carve the calibrated score line into three regions:

0 ──────────────┬──────────────────────┬────────── 1

ALLOW T_low REVIEW T_high BLOCK

(below T_low) (step-up / 3DS) (above T_high)

low p_fraud uncertain middle high p_fraud

T_high and T_low come from the cost curve, not intuition. The Bayes-optimal decision: block when the expected cost of allowing exceeds the expected cost of blocking —

block when p_fraud · amount > (1 − p_fraud) · margin_lost

Because the left side scales with amount, the threshold moves with the transaction value: you should block a $5,000 transaction at a much lower fraud probability than a $5 one. A single global probability threshold is wrong by construction.

The review band

The middle region is where step-up earns its keep. Routing the uncertain band to a 3DS challenge inconveniences a good customer minimally (one extra auth step) while, on success, shifting chargeback liability to the issuer. That turns a hard, lossy block/allow decision into a graded response — the third action from Step 0, now operationalized.

Per-segment thresholds. Different merchants, geographies, and product types have different fraud base rates and different margins, so T_low and T_high are policy-driven per segment, not one global pair of numbers.

Watch both sides. Monitor false-decline rate and approval rate right next to fraud rate. Over-tightening to crush fraud silently destroys revenue — the failure mode juniors never instrument because the fraud dashboard looks fantastic while the business bleeds.

Deep dive — delayed, adversarial & biased labels

WHERE STAFF IS WON

This is where a Staff answer is won. Everything above assumes you have clean labels to train on. You do not. Fraud labels are delayed, selection-biased, and adversarial all at once, and handling that is the hardest and most distinctive part of the system.

Label delay & right-censoring

A fraud label arrives as a chargeback, which lands weeks to ~90+ days after the transaction. So a transaction that looks “good” today is really “no chargeback yet” — not confirmed legitimate.

t0 ───────────────────────────────────────────────► time

│ txn ~days ~weeks ~90+ days

│ scored dispute chargeback window closes

│ window can post → label final

│ opens

│

│ recent data here is RIGHT-CENSORED:

│ absence of a chargeback is not yet a "good" label

Recent data is right-censored: training on the last few weeks naively under-counts fraud, because the chargebacks simply haven’t posted yet. Treat label maturity explicitly — only treat a transaction as a confirmed negative once its dispute window has closed, and model the censoring rather than pretending recent “good” labels are final.

Selection bias / the feedback loop

The deeper trap: you only observe outcomes for transactions you ALLOWED. A transaction you blocked never reaches the network, so it never produces a chargeback — it has no ground truth at all. Train naively and the model learns from a censored slice of its own past decisions, growing ever more confident about a boundary it can no longer see past.

Mitigations:

Reject inference — statistically infer the likely outcome of rejected transactions so they don't vanish from the training distribution.
Randomized exploration budget — deliberately allow through a small, randomized fraction of transactions you would have blocked, accepting bounded fraud loss to buy an unbiased label signal at the boundary. This is the only way to know what's happening in the region you keep blocking.

Adversarial drift

This is concept drift caused by an adversary, not by nature. Fraudsters actively probe your boundary, adapt, and re-attack. A static model decays fast because the decision boundary is a live target someone is paid to defeat. You cannot retrain quarterly and expect to keep up.

Two-speed architecture

Resolve the tension between “labels take 90 days” and “the adversary moves in hours” by splitting the system into two layers running at different clocks:

Layer

Trained on

Cadence

Job

Slow stable

Settled, fully-labeled history

Periodic, heavy retrain

Accurate base rate on known fraud patterns

Fast adaptive

Recent signals, analyst labels, anomalies

Hours

React to emerging attacks before chargebacks exist

The fast layer is rules + a frequently-retrained light model + the anomaly detector from Step 4. It reacts to a new card-testing pattern within hours — long before a single chargeback for that pattern has posted. The slow layer provides the stable, well-calibrated backbone trained only on labels that have fully matured.

Keep a clean evaluation signal

Maintain a small randomized challenge / holdout stream and watch score- and feature-distribution drift. A quiet shift in the input distribution is the early warning that fires before the fraud rate visibly moves — by the time fraud rate spikes, you’ve already lost weeks. Input drift is the smoke detector; fraud rate is the fire alarm.

Frequent retrain, gated rollout

“The ability to retrain on new data is what enables adaptation to new fraud,” so retrain often and roll models out rapidly. But a bad fraud model is an instant, network-wide loss, so every model ships behind shadow + canary (Step 7). Fast iteration and safe rollout are not in tension — the canary is what lets you iterate fast without betting the network on each new model.

Reliability, deployment & human-in-the-loop

Safe model rollout

Never flip a fraud model globally in one step. Roll out in stages:

1. Shadow — the new model scores live traffic, but its decisions are not applied; compare its would-be decisions against production.

2. Canary — apply it to a small percentage of traffic with automated rollback triggered by any regression in fraud rate or decline rate.

3. Ramp — widen only after the canary holds on both fraud and false-decline metrics.

Review queue

The REVIEW action feeds a human-analyst queue with an SLA. This closes the label loop much faster than chargebacks: an analyst’s verdict becomes a high-quality label within hours, not 90 days, and feeds straight back into the fast-layer training set. Human review is not just risk mitigation — it’s your fastest clean label source.

Observability

Online/offline parity guard — continuously verify that features served online equal features computed offline for the same event. Training/serving skew here silently degrades the model and is a classic, hard-to-find production bug.
Idempotency & audit — log every decision with its exact feature vector, model version, and reason, both as dispute / representment evidence and to debug adversarial probing after the fact.

Dashboards & alerts:

p99 latency and fallback-invocation rate (how often the rules fired instead of the model)
approval rate, false-decline rate, fraud rate (settled + censoring-adjusted estimate)
score-distribution drift, feature-freshness lag

Privacy / compliance. Stay inside PCI-DSS scope: use network tokenization so the hot path handles tokens, not raw PANs, and respect data residency by computing region-local features where required.

Tradeoffs, failure modes & what I'd cut

Tensions I’m holding

Tension

Pulls toward

Resolution

Latency vs feature richness

Graph / network features raise precision but cost p99

Precompute + cache; bound the online critical set

Fraud loss vs false declines

Every threshold move trades one for the other

Derive thresholds from the cost curve, per segment

Fast adaptation vs stability

Frequent retrains chase drift but risk bad models

Two-speed layers + shadow/canary gating

The right answer to the fraud-vs-decline tension is always the cost curve. The wrong answer — the one that sounds responsible and quietly destroys revenue — is “just raise the threshold.”

What I’d cut for v1

Ship the supervised GBDT + rules + calibrated two-threshold decision + deterministic rule fallback first. Defer the GNN/graph layer and the reject-inference loop to v2, once the serving SLO is proven in production. The hot-path latency budget and the fallback are non-negotiable from day one; the fancier features are precision gains you add after the gate is reliably fast.

How I’d know it’s failing

Silent over-blocking — fraud rate looks great while revenue quietly drops. Caught by the false-decline / approval-rate dashboards, never the fraud dashboard.
Label feedback collapse — the model stops seeing the fraud it blocks and drifts toward over-confidence. Caught by the randomized exploration holdout.
Adversarial boundary mapping — attackers binary-search your threshold with small probes. Mitigated by randomized challenge thresholds and frequent retrains so the boundary keeps moving under their feet.

✓

Summary

What a Staff answer leaves on the table — four differentiators:

It's a hard real-time gate on the money path, not a best-effort ranker. A deterministic fallback and a p99 budget are non-negotiable, and the payment must resolve even when the model doesn't.
The decision is calibrated expected-cost minimization with three actions (allow / review-step-up / block), where the thresholds scale with transaction amount — not a fixed 0.5 on an uncalibrated score.
Labels are delayed, adversarial, and selection-biased. The standout move is the two-speed loop (fast adaptive + slow stable) plus a randomized exploration budget to keep an unbiased signal.
Staff signal = naming the failure modes juniors miss: silent over-blocking destroying revenue, feedback-loop label collapse, and an adversary actively mapping your boundary.

If you remember nothing else: optimize dollars of expected cost under a 100ms deadline against an adversary who gets weeks of label delay to study you.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Problem framing & objective

States allow/block, minimizes fraud loss, picks a reasonable threshold and AUC/recall target

Frames as expected-cost minimization with explicit FP (lost good txn + customer trust) vs FN (chargeback + fees + fines) dollar costs; defines allow/review/block and a step-up action that shifts liability

Real-time serving & latency

Synchronous service, model co-located, caches features in Redis, targets ~100ms

Explicit p99 budget breakdown (feature fetch 2-5ms each, inference 1-2ms), bounds feature count, parallelizes fetches, sets a hard deadline, and degrades to rules on timeout so the payment always resolves

Feature engineering

Velocity counters, device fingerprint, amount, MCC, geo, card-on-file age

Multi-entity velocity (card/device/IP/merchant) at multiple windows, network-wide per-card state, learned categorical embeddings (issuer/merchant/country), graph associations for fraud rings, and is explicit about which are cheap online vs precomputed

Modeling & imbalance

Gradient-boosted trees, handles imbalance via class weights or resampling, mentions calibration

Supervised GBDT/deep net for the signal + unsupervised (isolation forest / autoencoder) for novel attacks; cost-sensitive learning over resampling; calibrates scores (Platt/isotonic) so the threshold is a real probability

Labels & training loop

Knows chargebacks are delayed, retrains periodically on labeled history

Quantifies label delay (weeks to ~90+ days), handles right-censoring, names selection bias (labels only for allowed txns) and proposes reject inference / exploration, and dedicates a fast layer to adversarial drift

Adversarial & drift robustness

Mentions fraudsters adapt and models need retraining

Separates a fast-adapting layer (rules + frequently retrained model) from a slow stable layer; monitors score/feature drift; randomized challenge holdouts to keep a clean signal; treats the boundary as something an adversary probes

Reliability, idempotency & ops

Idempotency key on the payment, logging, monitors fraud rate

Idempotent scoring keyed to the payment attempt, shadow/canary for new models, online-offline feature parity (point-in-time correctness), human review queue with SLA, and dashboards on false-decline rate + approval rate, not just fraud rate

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →