AI System DesignStaffVector SearchANN Retrieval

Design a Vector Search & Embedding Retrieval System at Scale

A Staff-level walkthrough of the vector search retrieval substrate behind Google, OpenAI, and Meta's RAG, search, and recsys stacks. It builds a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — treating recall as a first-class SLO and quantization-plus-rescore as the economic enabler — and covers filtered + hybrid retrieval, freshness, and the re-embedding migration story that separates staff-level answers.

Level: Staff
Category: AI Infrastructure · Vector Retrieval
Interview time: 60 min

100% free · No login required

WHAT THIS QUESTION TESTS

·Index family choice (HNSW vs IVF-PQ vs DiskANN) on the recall-latency curve

·Sharding + scatter-gather with partial top-k merge at billion-vector scale

·Quantization (SQ8/PQ/RaBitQ) paired with a full-precision rescore stage

·Filtered + hybrid (dense + sparse / RRF) retrieval with recall as a first-class SLO

★ STAFF-LEVEL SIGNALS

★Quotes QPS at a fixed recall@k (e.g. 8K QPS @ 0.98 recall@10), never QPS alone

★Plans sharding from capacity math: 1B fp32 HNSW = ~6–8 TB, infeasible on one node

★Handles the selectivity cliff: low-selectivity pre-filters collapse HNSW toward brute force

★Owns the re-embedding migration: dual-index, shadow-read, multi-day cutover on model upgrade

Scope & ambiguity

Let me frame this as a standalone retrieval substrate — the service that turns a query into top-k nearest neighbors over billions of embeddings — not an end-to-end RAG app. I’ll own the index, the storage hierarchy, sharded scatter-gather serving, filtered and hybrid retrieval, and the freshness/migration story. I’ll explicitly exclude LLM generation, prompt assembly, and downstream ranking business logic — those are consumers of my top-k. The single most important reframing: correctness here is probabilistic. We don’t return the nearest neighbors, we return them with a target recall@k SLO, and almost every design knob is really a recall-vs-latency-vs-dollars trade.

This is the kind of system that sits underneath search, RAG, and recsys at Google, OpenAI, Meta, and dedicated vendors like Pinecone — a shared substrate many teams query. Interviewers use it to probe whether you can run a sharded, replicated, read-heavy datastore where the answer itself is an approximation with an error budget.

Service boundary

In scope: insert/update/delete vectors, top-k ANN query, metadata filtering, hybrid dense+sparse fusion, optional rerank, freshness, and the re-embedding migration when the embedding model changes.

Out of scope: the LLM that consumes the chunks, prompt construction, the chunking policy (we accept chunks), and business ranking rules layered on top of retrieval.

Use cases

Semantic search — natural-language query over a corpus.
RAG retrieval — fetch grounding chunks for an LLM.
Recsys candidate generation — user/item embedding → candidate set for a downstream ranker.
Dedup / near-duplicate detection — find vectors within a similarity threshold.

Service contract

Given a query embedding (or text we encode server-side) plus optional metadata filters, return the top-k approximate nearest neighbors at a stated recall SLO, within a p99 latency budget, respecting tenant/ACL isolation.

Who asks this & what they probe

Role

Focus

What they probe

SDE

Distributed-systems spine

Shard router and scatter-gather, partial top-k merge with tie-breaking, partial-failure handling, segment write path with background compaction, storage hierarchy (RAM hot, SSD/object cold, raw-vector store), capacity math forcing sharding, metadata filter path and ACL isolation

MLE

Quality-vs-cost frontier

Index family and params (HNSW efSearch/M, IVF nprobe/nlist), quantization (SQ8/PQ/RaBitQ) paired with rescore, Matryoshka truncation, hybrid dense+sparse with RRF, cross-encoder rerank, offline eval harness with exact-kNN ground truth, embedding-model lifecycle and re-embedding

Switcher (SDE → AI)

Mapping to known systems

That this is a sharded, replicated, read-heavy datastore with a write/compaction path and scatter-gather — like Elasticsearch/Lucene or a partitioned KV store. New vocabulary: ANN vs exact kNN, recall@k as an SLO (probabilistic correctness — the big shift), the HNSW/IVF/PQ trio, and same-model-version coupling between ingest and query

A strong switcher answer leads with the systems design they already own and explicitly flags the two ML-specific calls — embedding-model/dimension choice and the recall/quantization operating point — as things to validate with an MLE rather than bluffing the ANN math.

Multi-tenancy

Many collections, many tenants. Tenant isolation is a correctness and security requirement, not an afterthought: filters must enforce ACLs, and a low-selectivity tenant filter must not silently degrade into a full-corpus scan (the filter cliff — Step 6).

Requirements

Functional requirements

Write path: insert, update, and delete vectors (with payload/metadata), at high ingest rates.
Query: top-k approximate nearest neighbors for a query vector or text.
Metadata filtering: structured predicates (tenant, lang, acl, timestamp, numeric ranges) combined with the vector search.
Hybrid retrieval: fuse dense (embedding) and sparse (BM25/SPLADE lexical) results.
Optional rerank: cross-encoder over a small candidate set when quality justifies the latency.
Freshness: newly written vectors become queryable within a bounded delay.

Non-functional requirements

Dimension

Target

Notes

Recall@k

0.95–0.99

First-class SLO; below ~0.90 downstream quality visibly degrades

Latency p50

5–20 ms

Single in-memory ANN hop

Latency p99

under 50–100 ms

+20–50 ms if cross-encoder rerank is on

QPS

10K–100K+

Scaled by replicas; always quoted at a fixed recall

Freshness

seconds to minutes

Use-case dependent

Durability

no silent vector loss

Raw vectors are the source of truth

Cost ceiling

$/M-vectors/mo

RAM is the dominant cost driver

Isolation

per-tenant ACL + fairness

Quotas to prevent noisy neighbors

The defining mental shift: probabilistic correctness

A conventional datastore returns the correct rows. An ANN index returns the probably-nearest rows. So:

Recall@k is an SLO with an error budget, monitored continuously against sampled exact-kNN ground truth.
A QPS number is meaningless without a fixed recall. "12K QPS/node" at recall 0.80 and at recall 0.98 are different machines. Always quote QPS-at-recall.
Most "tuning" is sliding along a recall-vs-latency-vs-memory-vs-dollars surface, not fixing bugs.

Back-of-envelope estimation

Why a single node is infeasible

Take 1B vectors at 1024 dimensions, fp32:

raw vectors = 1e9 * 1024 * 4 bytes = 4.10 TB

HNSW graph = 2M links/node * 4 B ≈ 0.13–0.5 TB (M ~ 16–64)

fp32 index = raw + graph ≈ 4.2–4.6 TB

Raw fp32 alone (~4.1 TB) already blows past any commodity node’s RAM; the graph only adds to it. Two forced conclusions, and they compound:

1. Shard the vector space across many nodes.

2. Quantize so each shard’s working set fits in RAM economically.

Quantization as the economic enabler

Scheme

Bytes/dim (1024-d)

Footprint

Compression

Recall behavior

fp32

~4 KB

Exact codes

SQ8 (int8)

~1 KB

~lossless with rescore

fractional

~64–512 B

8–64x

Lossy; needs rescore

RaBitQ / binary

~0.125

~128 B

~32x

1-bit; only for first-pass, never final ranking

Quantize the codes the index searches over, then rescore the top ~100–500 candidates with full-precision vectors. Quantization makes single-node economics work; rescore buys the recall back.

QPS and compute

In-memory ANN engines deliver roughly 8K–12K QPS/node at ~0.98 recall@10, single-threaded-per-query with a few-ms budget. Scale QPS by adding read replicas.
Each query touches O(efSearch) or O(nprobe · list_size) candidate distance computes per shard — cheap individually, but multiplied by fan-out across shards.
Cross-encoder rerank adds 20–50 ms for 50–200 candidates on a GPU — a separate budget line.

Build and backfill throughput

HNSW build for 100M vectors = hours; a full IVF rebuild at 1B = days.
Embedding backfill is bounded by GPU throughput and dollars — re-embedding a billion-doc corpus is a multi-day, real-money job (Step 7).

The build-time numbers are why incremental updates are mandatory — you cannot rebuild from scratch on every write, so you need a segment-based write path with background compaction.

Memory budget lever

Matryoshka truncation (3072 → 512 dims) cuts the RAM footprint ~6x with a modest, measurable recall cost — a multiplicative cost lever stacked on top of quantization.

API design

Keep the surface minimal and make version parity and consistency explicit.

createCollection(

name,

dim, # e.g. 1024

metric, # cosine | dot | l2

index_params { # HNSW: {M, efConstruction}

family, # IVF: {nlist}

quantization, # sq8 | pq | rabitq | none

rescore # bool: full-precision rescore stage

embedding_model_version # parity contract for this collection

)

upsert(collection, id, vector?, text?, payload) -> {status, version}

# exactly one of vector|text; text is encoded server-side

# payload: arbitrary JSON for filtering (tenant, lang, acl, ts)

delete(collection, id) -> {status} # tombstone; compacted later

search(

collection,

query_vector? | text?, # text encoded server-side, SAME model version

filter, # filter DSL, see below

params {efSearch | nprobe},# recall/latency knob

hybrid?, # {sparse: bm25|splade, fusion: rrf}

rerank?, # {model, top_n}

consistency # strong | eventual

) -> [{id, score, payload, model_version}]

Metric and normalization

Cosine is dot product on L2-normalized vectors — normalize once at ingest and at query so the engine runs the cheaper dot-product kernel. The metric is fixed per collection.

Filter DSL

filter = {

"must": [{"tenant": "acme"}, {"lang": "en"}],

"should": [...],

"must_not":[{"status": "deleted"}],

"range": {"ts": {"gte": 1700000000}}

}

Version tagging — the migration lever

Every vector carries an embedding_model_version. A query encoded by model v2 is meaningless against a corpus embedded by v1: the spaces are unrelated, and relevance silently collapses. So query encoding happens server-side behind the same embedding service used for ingest, and the collection pins a model version. This single field is what makes the re-embedding migration (Step 7) tractable.

Consistency semantics

Offer read-your-writes for interactive flows (route the writer’s reads to the replica that has the new segment, or block on segment visibility) and eventual for bulk ingest, where a few seconds of staleness is fine and cheaper.

Data model

The system is three coupled stores, all keyed by vector id. A query reads across all three.

┌─────────────────────────────────────────┐

id ──► │ 1. ANN index (graph/IVF + quantized │

│ codes) — the searchable structure │

├─────────────────────────────────────────┤

│ 2. Raw vector store (fp32/fp16) │

│ — rescore + re-embedding source │

├─────────────────────────────────────────┤

│ 3. Metadata/payload store + inverted index│

│ — filtering, ACL, tenant isolation │

└─────────────────────────────────────────┘

1. ANN index — the graph (HNSW) or coarse-quantizer + posting lists (IVF), holding the quantized codes the search traverses. This is the hot, RAM-resident structure.

2. Raw full-precision vector store — fp32/fp16 vectors. Used to rescore ANN candidates and as the source for re-embedding migrations. Can live on SSD/object storage since it’s touched only for the top ~hundreds per query.

3. Metadata/payload store + inverted index — structured fields with an inverted index for filtering and ACL enforcement.

Segment model for writes

Writes land in small, immutable segments (like Lucene). New segments are built and made queryable quickly; background compaction merges small segments into larger ones and rebuilds graph/IVF structures offline, so queries never block on builds. Deletes are tombstones (soft deletes); compaction physically removes them and reclaims graph edges.

Why tombstones matter for recall

Ignoring deletes is a silent recall bug: tombstoned vectors still occupy graph nodes and edges, so a query can spend its candidate budget traversing toward dead entries and return fewer live results than k. Compaction is a recall maintenance task, not just space reclamation.

Sharding key and replica layout

Choice

Pro

Con

Random sharding

Even load, simple

Every query fans out to all shards

Clustered (centroid)

Query touches few shards

Drift, rebalancing risk, hot shards

Default to random sharding for balanced fan-out, with read replicas per shard for QPS and failover. Every vector also carries embedding_model_version, which gates dual-index migrations.

High-level architecture

Ingest pipeline

docs → chunk → embed (batched GPU service) → normalize

→ quantize codes → write segment (ANN + raw + metadata)

→ [background] compaction merges segments, rebuilds index

Continuous micro-batching is what makes the embedding tier affordable: it lifts GPU utilization from under 20% (per-request) to over 70%, the difference between a viable and a wasteful backfill.

Query pipeline

query text ─► embed (SAME model version) ─► normalize

│

▼

Router ── scatter ──► Shard 1 ─┐

Shard 2 ─┤ per shard:

... │ filter (inverted idx)

Shard N ─┘ → ANN over codes (top-N)

│ → full-precision rescore

│◄───────── gather partial top-N ─────────┘

▼

merge top-k (heap, tie-break by id)

│

▼

hybrid fuse (RRF with sparse results)

│

▼

optional cross-encoder rerank (top 50–200)

│

▼

return top-k

Scatter-gather details (the SDE core)

Scatter the encoded query to all shards (random sharding ⇒ full fan-out).
Each shard runs filter → ANN over quantized codes → local rescore → returns its partial top-N.
Gather into a bounded merge heap; tie-break deterministically (by id) so results are stable.
Partial-failure handling: if a shard times out or dies, return from the survivors with a degraded flag and a coverage estimate rather than failing the whole query — recall degrades gracefully instead of the request erroring.

Control plane

Manages collection/shard/replica lifecycle, autoscaling on QPS and RAM headroom, rebalancing on node loss, and the blue-green / dual-index machinery for builds and migrations. It is off the query hot path.

Hot/cold tiers

Hot shards in RAM serve the latency SLO; cold or rarely-queried partitions live on SSD via DiskANN-style indexes or in object storage, paged in on demand — trading latency for dollars on the long tail.

Deep dives — where Staff is won

WHERE STAFF IS WON

Deep dive 1: Index family and the recall-latency curve

Family

Strength

Cost / caveat

HNSW

Best in-RAM recall/latency

High RAM; graph overhead

IVF-PQ

Compact, tunable

nprobe trades recall for latency

DiskANN / Vamana

1B+ on SSD

Higher latency; SSD-bound

SPFresh

Fresh + on-disk updates

Newer, more operational complexity

HNSW knobs: M (edges per node) and efConstruction set graph quality at build time; efSearch is the runtime recall-latency dial — raise it for higher recall at more distance computes. IVF knobs: nlist (number of coarse cells) at build, nprobe (cells probed per query) at runtime.

For billion-scale corpora that can’t fit RAM economically, DiskANN keeps the graph on SSD and serves at higher-but-acceptable latency. Across all families the pattern is identical: search quantized codes, then rescore the top ~100–500 with full precision to recover the recall quantization gave up.

Trap: serving the final ranking off binary/PQ codes. Quantized distances are fine for the first-pass candidate set but too lossy for ordering — they tank recall. The rescore stage exists precisely to fix this.

Deep dive 2: Quantization + rescore economics

ANN over compressed codes → top 200 candidates (cheap, RAM-resident)

fetch fp32 vectors → exact distance (SSD/RAM, ~200 reads)

re-sort → top k (recall recovered)

This two-stage shape is the economic heart of the system: PQ/RaBitQ shrink the searchable structure 8–64x so it fits in RAM, and a bounded rescore over a few hundred full-precision vectors restores recall@k to target. The rescore cost is fixed per query (hundreds of vector reads), so it doesn’t scale with corpus size.

Deep dive 3: Filtered search and the selectivity cliff

The hardest interaction in the system. Three strategies:

Strategy

When it works

Failure mode

Pre-filter (filter then ANN)

Low selectivity (few pass)

ANN graph fragments → near brute force

Post-filter (ANN then filter)

High selectivity (most pass)

May not fill k; re-query

Filtered ANN (in-traversal)

Middle ground

Engine-specific complexity

The cliff: when a filter is very selective (say 0.1% of vectors pass), pre-filtering leaves the HNSW graph so sparse that traversal collapses toward brute force, spiking latency 10–100x. Post-filtering on the same query throws away almost everything and fails to fill k. The fix is adaptive routing: estimate selectivity from the inverted index and choose pre-filter, in-traversal filtering, or brute-force-over-the-filtered-set accordingly. Tenant/ACL filters are exactly the low-selectivity case — so isolation correctness and the latency cliff are the same problem.

Deep dive 4: Incremental updates and compaction

Writes go to fresh segments queryable in seconds; background compaction merges segments and rebuilds graphs without blocking queries. Tombstone deletes are reclaimed during compaction. The tuning tension: compact too rarely and you accumulate many small segments (fan-out within a shard balloons, deleted nodes pollute recall); compact too aggressively and you burn CPU/IO rebuilding graphs. This is the freshness-vs-build-cost knob.

Trap: ignoring deletes. Tombstones that never compact degrade recall over time as queries waste their candidate budget on dead graph nodes.

Deep dive 5: Hybrid retrieval, RRF, and rerank

Dense and lexical retrieval have complementary failure modes: dense embeddings miss exact tokens, rare IDs, and out-of-distribution jargon; BM25/SPLADE miss paraphrase and semantic similarity. Run both and fuse with Reciprocal Rank Fusion:

RRF(d) = Σ_retrievers 1 / (k_rrf + rank_r(d)) # k_rrf ≈ 60

RRF fuses by rank, not raw scores, so dense cosine and BM25 magnitudes never need calibrating against each other — robust and nearly parameter-free.

When quality justifies it, a cross-encoder reranks the top 50–200 fused candidates, jointly attending to query and document for far better ordering — at +20–50 ms. Reserve it for cases where the recall/precision lift pays for the latency; it does not earn its cost on every query. Late-interaction (ColBERT-style) rerankers sit between bi-encoder speed and cross-encoder quality.

Deep dive 6: Matryoshka dimension truncation

Matryoshka-trained embeddings pack coarse-to-fine information front-loaded across dimensions, so you can truncate 3072 → 512 and still retrieve well. That’s a ~6–12x RAM cut, multiplicative with quantization. A common pattern: search on truncated dims, rescore on full dims — dimension truncation as its own first-pass/rescore split.

Deep dive 7: Measuring recall

Recall is meaningless asserted; it must be measured against exact kNN. Sample real queries, compute brute-force exact top-k offline as ground truth, and track recall@k per slice (per tenant, per language, per filter pattern). Wire this into the eval harness and the production monitor — QPS-at-recall, not QPS alone.

Trap: model/version skew. If ingest and query encoders drift even one version apart, relevance silently collapses while every systems metric looks green. Version parity is a correctness invariant, enforced by encoding queries through the same pinned model.

Multi-team rollout

Index builds and blue-green swaps

Build new index segments offline, validate recall against sampled exact-kNN ground truth, then blue-green swap the serving index atomically — old index stays warm as instant rollback. Never mutate the live serving structure in place during a rebuild.

The hard one: re-embedding migration

When the embedding model upgrades (v1 → v2), the entire corpus must be re-embedded — the spaces are incompatible, so a mixed-version index returns garbage. At billion scale this is a multi-day, real-money operation:

1. Dual-index: stand up the v2 index alongside v1; backfill by re-embedding from the raw vector store / source docs (the reason we keep full-precision vectors).

2. Shadow-read: mirror live query traffic to v2, encoding each query with v2, and compare recall/relevance offline — no user impact.

3. Cutover: once v2 meets the recall SLO, flip traffic (canary by tenant/percentage), keeping v1 warm for rollback until confident.

The embedding_model_version field on every vector and the pinned per-collection model are what make each stage safe and auditable.

Monitoring

Signal

Why

recall@k vs exact-kNN

The SLO; sampled continuously

QPS-at-recall, p50/p99

Capacity at fixed quality

Embedding drift

Catches silent relevance decay

Tombstone ratio / segment count

Compaction health → recall health

GPU / RAM utilization, headroom

Cost and capacity

Failover and fairness

Replica failover within a shard; rebalancing on node loss (with the drift caveat for clustered sharding); per-tenant quotas and fairness so one tenant’s burst or low-selectivity filter can’t starve others.

Bottlenecks & evolution

Where it breaks

Fan-out cost at many shards — full scatter-gather to hundreds of shards dominates tail latency. Mitigation: coarse-centroid query routing to probe only the shards likely to hold neighbors — at the cost of drift and rebalancing complexity, and a recall risk if routing is wrong.
RAM cost — the dominant dollar line. Push to SSD/DiskANN plus aggressive quantization and Matryoshka truncation; accept higher latency on cold partitions.
Filter cliffs — low-selectivity filters collapsing toward brute force. Mitigation: adaptive pre/post/in-traversal routing driven by selectivity estimates.
Freshness vs build cost — frequent compaction is expensive; rare compaction hurts recall. Mitigation: in-place / fresh-update indexes (SPFresh-style) to narrow the gap.
Re-embedding cost — re-encoding billions on every model bump. Mitigation: Matryoshka (cheaper to store/migrate) and distilled/smaller encoders.
Rerank latency — cross-encoders are expensive per query. Mitigation: late-interaction (ColBERT-style) and learned-sparse retrieval that move quality earlier and cheaper.

Trends to flag

RaBitQ / 1-bit quantization adoption for first-pass candidate generation (always with rescore).
GPU-resident indexes for very high QPS at fixed recall.
Disaggregated storage — compute/index separated from a cold raw-vector tier for elastic scaling and cheaper cold capacity.

✓

Summary

1. The spine is a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — a read-heavy, replicated datastore with a segment write path and background compaction, much like a search engine.

2. Recall is a first-class SLO with an error budget, measured against sampled exact-kNN — not an assumption. Always quote QPS-at-a-fixed-recall; a QPS number without a recall is meaningless.

3. Quantization + full-precision rescore is the economic enabler: compress codes 8–64x so the index fits in RAM, then rescore the top hundreds to win recall back. Capacity math (1B fp32 ≈ 4–5 TB) forces both sharding and quantization.

4. Design filtered and hybrid retrieval as first-class: handle the selectivity cliff with adaptive routing (it’s the same problem as ACL/tenant isolation), and fuse dense + sparse via RRF because their failure modes are complementary.

5. Have a real re-embedding migration story — dual-index, shadow-read, cutover — backed by an embedding_model_version field and strict ingest/query version parity, the invariant whose violation silently destroys relevance.

6. Role split: the SDE owns the systems spine (router, scatter-gather, compaction, storage hierarchy); the MLE owns the recall/quantization frontier and the embedding lifecycle. A switcher should lead with the partitioned-datastore design they already know and flag embedding-model/dimension and the recall operating point as the two ML calls to validate — not bluff the ANN math.

★

Rubric — Senior vs Staff

Dimension

Senior signal

Staff signal

Index family

Picks HNSW and tunes ef.

Chooses HNSW vs IVF-PQ vs DiskANN per shard on the recall-latency-memory curve and justifies the tradeoff.

Capacity & sharding

Knows vectors need many machines.

Derives 1B fp32 HNSW = ~6–8 TB → must shard; designs scatter-gather with partial top-k merge and partial-failure handling.

Quantization

Mentions PQ to save memory.

Layers SQ8 hot / PQ or RaBitQ binary cold, always with a full-precision rescore of the top ~100–500; never ranks off binary codes alone.

Filtering

Applies metadata filters after search.

Routes by selectivity (integrated/pre/post), handles the low-selectivity cliff, and enforces ACL/tenant isolation at retrieval time.

Recall as SLO

Targets high recall.

Quotes QPS at a fixed recall@k, monitors recall against sampled exact-kNN ground truth, and treats it with error budgets like latency.

Hybrid & rerank

Uses dense embeddings.

Fuses dense + sparse (BM25/SPLADE) via RRF for complementary failure modes; adds a cross-encoder rerank only where latency budget justifies it.

Freshness & migration

Rebuilds the index periodically.

Segment-based incremental writes + compaction; plans the re-embedding migration (dual-index, shadow-read, cutover) on model upgrade.

★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →