← Back to all questions
AI System DesignStaffVector SearchANN Retrieval

Design a Vector Search & Embedding Retrieval System at Scale

A Staff-level walkthrough of the vector search retrieval substrate behind Google, OpenAI, and Meta's RAG, search, and recsys stacks. It builds a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — treating recall as a first-class SLO and quantization-plus-rescore as the economic enabler — and covers filtered + hybrid retrieval, freshness, and the re-embedding migration story that separates staff-level answers.

Level
Staff
Category
AI Infrastructure · Vector Retrieval
Interview time
60 min
100% free · No login required
WHAT THIS QUESTION TESTS
·Index family choice (HNSW vs IVF-PQ vs DiskANN) on the recall-latency curve
·Sharding + scatter-gather with partial top-k merge at billion-vector scale
·Quantization (SQ8/PQ/RaBitQ) paired with a full-precision rescore stage
·Filtered + hybrid (dense + sparse / RRF) retrieval with recall as a first-class SLO
★ STAFF-LEVEL SIGNALS
Quotes QPS at a fixed recall@k (e.g. 8K QPS @ 0.98 recall@10), never QPS alone
Plans sharding from capacity math: 1B fp32 HNSW = ~6–8 TB, infeasible on one node
Handles the selectivity cliff: low-selectivity pre-filters collapse HNSW toward brute force
Owns the re-embedding migration: dual-index, shadow-read, multi-day cutover on model upgrade
0

Scope & ambiguity

Let me frame this as a standalone retrieval substrate — the service that turns a query into top-k nearest neighbors over billions of embeddings — not an end-to-end RAG app. I’ll own the index, the storage hierarchy, sharded scatter-gather serving, filtered and hybrid retrieval, and the freshness/migration story. I’ll explicitly exclude LLM generation, prompt assembly, and downstream ranking business logic — those are consumers of my top-k. The single most important reframing: correctness here is probabilistic. We don’t return the nearest neighbors, we return them with a target recall@k SLO, and almost every design knob is really a recall-vs-latency-vs-dollars trade.

This is the kind of system that sits underneath search, RAG, and recsys at Google, OpenAI, Meta, and dedicated vendors like Pinecone — a shared substrate many teams query. Interviewers use it to probe whether you can run a sharded, replicated, read-heavy datastore where the answer itself is an approximation with an error budget.

Service boundary

In scope: insert/update/delete vectors, top-k ANN query, metadata filtering, hybrid dense+sparse fusion, optional rerank, freshness, and the re-embedding migration when the embedding model changes.

Out of scope: the LLM that consumes the chunks, prompt construction, the chunking policy (we accept chunks), and business ranking rules layered on top of retrieval.

Use cases

  • Semantic search — natural-language query over a corpus.
  • RAG retrieval — fetch grounding chunks for an LLM.
  • Recsys candidate generation — user/item embedding → candidate set for a downstream ranker.
  • Dedup / near-duplicate detection — find vectors within a similarity threshold.

Service contract

Given a query embedding (or text we encode server-side) plus optional metadata filters, return the top-k approximate nearest neighbors at a stated recall SLO, within a p99 latency budget, respecting tenant/ACL isolation.

Who asks this & what they probe

Role
Focus
What they probe
SDE
Distributed-systems spine
Shard router and scatter-gather, partial top-k merge with tie-breaking, partial-failure handling, segment write path with background compaction, storage hierarchy (RAM hot, SSD/object cold, raw-vector store), capacity math forcing sharding, metadata filter path and ACL isolation
MLE
Quality-vs-cost frontier
Index family and params (HNSW efSearch/M, IVF nprobe/nlist), quantization (SQ8/PQ/RaBitQ) paired with rescore, Matryoshka truncation, hybrid dense+sparse with RRF, cross-encoder rerank, offline eval harness with exact-kNN ground truth, embedding-model lifecycle and re-embedding
Switcher (SDE → AI)
Mapping to known systems
That this is a sharded, replicated, read-heavy datastore with a write/compaction path and scatter-gather — like Elasticsearch/Lucene or a partitioned KV store. New vocabulary: ANN vs exact kNN, recall@k as an SLO (probabilistic correctness — the big shift), the HNSW/IVF/PQ trio, and same-model-version coupling between ingest and query

A strong switcher answer leads with the systems design they already own and explicitly flags the two ML-specific calls — embedding-model/dimension choice and the recall/quantization operating point — as things to validate with an MLE rather than bluffing the ANN math.

Multi-tenancy

Many collections, many tenants. Tenant isolation is a correctness and security requirement, not an afterthought: filters must enforce ACLs, and a low-selectivity tenant filter must not silently degrade into a full-corpus scan (the filter cliff — Step 6).

1

Requirements

Functional requirements

  • Write path: insert, update, and delete vectors (with payload/metadata), at high ingest rates.
  • Query: top-k approximate nearest neighbors for a query vector or text.
  • Metadata filtering: structured predicates (tenant, lang, acl, timestamp, numeric ranges) combined with the vector search.
  • Hybrid retrieval: fuse dense (embedding) and sparse (BM25/SPLADE lexical) results.
  • Optional rerank: cross-encoder over a small candidate set when quality justifies the latency.
  • Freshness: newly written vectors become queryable within a bounded delay.

Non-functional requirements

Dimension
Target
Notes
Recall@k
0.95–0.99
First-class SLO; below ~0.90 downstream quality visibly degrades
Latency p50
5–20 ms
Single in-memory ANN hop
Latency p99
under 50–100 ms
+20–50 ms if cross-encoder rerank is on
QPS
10K–100K+
Scaled by replicas; always quoted at a fixed recall
Freshness
seconds to minutes
Use-case dependent
Durability
no silent vector loss
Raw vectors are the source of truth
Cost ceiling
$/M-vectors/mo
RAM is the dominant cost driver
Isolation
per-tenant ACL + fairness
Quotas to prevent noisy neighbors

The defining mental shift: probabilistic correctness

A conventional datastore returns the correct rows. An ANN index returns the probably-nearest rows. So:

  • Recall@k is an SLO with an error budget, monitored continuously against sampled exact-kNN ground truth.
  • A QPS number is meaningless without a fixed recall. "12K QPS/node" at recall 0.80 and at recall 0.98 are different machines. Always quote QPS-at-recall.
  • Most "tuning" is sliding along a recall-vs-latency-vs-memory-vs-dollars surface, not fixing bugs.
2

Back-of-envelope estimation

Why a single node is infeasible

Take 1B vectors at 1024 dimensions, fp32:

raw vectors = 1e9 * 1024 * 4 bytes = 4.10 TB
HNSW graph = 2M links/node * 4 B ≈ 0.13–0.5 TB (M ~ 16–64)
fp32 index = raw + graph ≈ 4.2–4.6 TB

Raw fp32 alone (~4.1 TB) already blows past any commodity node’s RAM; the graph only adds to it. Two forced conclusions, and they compound:

1. Shard the vector space across many nodes.

2. Quantize so each shard’s working set fits in RAM economically.

Quantization as the economic enabler

Scheme
Bytes/dim (1024-d)
Footprint
Compression
Recall behavior
fp32
4
~4 KB
1x
Exact codes
SQ8 (int8)
1
~1 KB
4x
~lossless with rescore
PQ
fractional
~64–512 B
8–64x
Lossy; needs rescore
RaBitQ / binary
~0.125
~128 B
~32x
1-bit; only for first-pass, never final ranking

Quantize the codes the index searches over, then rescore the top ~100–500 candidates with full-precision vectors. Quantization makes single-node economics work; rescore buys the recall back.

QPS and compute

  • In-memory ANN engines deliver roughly 8K–12K QPS/node at ~0.98 recall@10, single-threaded-per-query with a few-ms budget. Scale QPS by adding read replicas.
  • Each query touches O(efSearch) or O(nprobe · list_size) candidate distance computes per shard — cheap individually, but multiplied by fan-out across shards.
  • Cross-encoder rerank adds 20–50 ms for 50–200 candidates on a GPU — a separate budget line.

Build and backfill throughput

  • HNSW build for 100M vectors = hours; a full IVF rebuild at 1B = days.
  • Embedding backfill is bounded by GPU throughput and dollars — re-embedding a billion-doc corpus is a multi-day, real-money job (Step 7).

The build-time numbers are why incremental updates are mandatory — you cannot rebuild from scratch on every write, so you need a segment-based write path with background compaction.

Memory budget lever

Matryoshka truncation (3072 → 512 dims) cuts the RAM footprint ~6x with a modest, measurable recall cost — a multiplicative cost lever stacked on top of quantization.

3

API design

Keep the surface minimal and make version parity and consistency explicit.

createCollection(
name,
dim, # e.g. 1024
metric, # cosine | dot | l2
index_params { # HNSW: {M, efConstruction}
family, # IVF: {nlist}
quantization, # sq8 | pq | rabitq | none
rescore # bool: full-precision rescore stage
},
embedding_model_version # parity contract for this collection
)
 
upsert(collection, id, vector?, text?, payload) -> {status, version}
# exactly one of vector|text; text is encoded server-side
# payload: arbitrary JSON for filtering (tenant, lang, acl, ts)
 
delete(collection, id) -> {status} # tombstone; compacted later
 
search(
collection,
query_vector? | text?, # text encoded server-side, SAME model version
k,
filter, # filter DSL, see below
params {efSearch | nprobe},# recall/latency knob
hybrid?, # {sparse: bm25|splade, fusion: rrf}
rerank?, # {model, top_n}
consistency # strong | eventual
) -> [{id, score, payload, model_version}]

Metric and normalization

Cosine is dot product on L2-normalized vectors — normalize once at ingest and at query so the engine runs the cheaper dot-product kernel. The metric is fixed per collection.

Filter DSL

filter = {
"must": [{"tenant": "acme"}, {"lang": "en"}],
"should": [...],
"must_not":[{"status": "deleted"}],
"range": {"ts": {"gte": 1700000000}}
}

Version tagging — the migration lever

Every vector carries an embedding_model_version. A query encoded by model v2 is meaningless against a corpus embedded by v1: the spaces are unrelated, and relevance silently collapses. So query encoding happens server-side behind the same embedding service used for ingest, and the collection pins a model version. This single field is what makes the re-embedding migration (Step 7) tractable.

Consistency semantics

Offer read-your-writes for interactive flows (route the writer’s reads to the replica that has the new segment, or block on segment visibility) and eventual for bulk ingest, where a few seconds of staleness is fine and cheaper.

4

Data model

The system is three coupled stores, all keyed by vector id. A query reads across all three.

┌─────────────────────────────────────────┐
id ──► │ 1. ANN index (graph/IVF + quantized │
│ codes) — the searchable structure │
├─────────────────────────────────────────┤
│ 2. Raw vector store (fp32/fp16) │
│ — rescore + re-embedding source │
├─────────────────────────────────────────┤
│ 3. Metadata/payload store + inverted index│
│ — filtering, ACL, tenant isolation │
└─────────────────────────────────────────┘

1. ANN index — the graph (HNSW) or coarse-quantizer + posting lists (IVF), holding the quantized codes the search traverses. This is the hot, RAM-resident structure.

2. Raw full-precision vector store — fp32/fp16 vectors. Used to rescore ANN candidates and as the source for re-embedding migrations. Can live on SSD/object storage since it’s touched only for the top ~hundreds per query.

3. Metadata/payload store + inverted index — structured fields with an inverted index for filtering and ACL enforcement.

Segment model for writes

Writes land in small, immutable segments (like Lucene). New segments are built and made queryable quickly; background compaction merges small segments into larger ones and rebuilds graph/IVF structures offline, so queries never block on builds. Deletes are tombstones (soft deletes); compaction physically removes them and reclaims graph edges.

Why tombstones matter for recall

Ignoring deletes is a silent recall bug: tombstoned vectors still occupy graph nodes and edges, so a query can spend its candidate budget traversing toward dead entries and return fewer live results than k. Compaction is a recall maintenance task, not just space reclamation.

Sharding key and replica layout

Choice
Pro
Con
Random sharding
Even load, simple
Every query fans out to all shards
Clustered (centroid)
Query touches few shards
Drift, rebalancing risk, hot shards

Default to random sharding for balanced fan-out, with read replicas per shard for QPS and failover. Every vector also carries embedding_model_version, which gates dual-index migrations.

5

High-level architecture

Ingest pipeline

docs → chunk → embed (batched GPU service) → normalize
→ quantize codes → write segment (ANN + raw + metadata)
→ [background] compaction merges segments, rebuilds index

Continuous micro-batching is what makes the embedding tier affordable: it lifts GPU utilization from under 20% (per-request) to over 70%, the difference between a viable and a wasteful backfill.

Query pipeline

query text ─► embed (SAME model version) ─► normalize
Router ── scatter ──► Shard 1 ─┐
Shard 2 ─┤ per shard:
... │ filter (inverted idx)
Shard N ─┘ → ANN over codes (top-N)
│ → full-precision rescore
│◄───────── gather partial top-N ─────────┘
merge top-k (heap, tie-break by id)
hybrid fuse (RRF with sparse results)
optional cross-encoder rerank (top 50–200)
return top-k

Scatter-gather details (the SDE core)

  • Scatter the encoded query to all shards (random sharding ⇒ full fan-out).
  • Each shard runs filter → ANN over quantized codes → local rescore → returns its partial top-N.
  • Gather into a bounded merge heap; tie-break deterministically (by id) so results are stable.
  • Partial-failure handling: if a shard times out or dies, return from the survivors with a degraded flag and a coverage estimate rather than failing the whole query — recall degrades gracefully instead of the request erroring.

Control plane

Manages collection/shard/replica lifecycle, autoscaling on QPS and RAM headroom, rebalancing on node loss, and the blue-green / dual-index machinery for builds and migrations. It is off the query hot path.

Hot/cold tiers

Hot shards in RAM serve the latency SLO; cold or rarely-queried partitions live on SSD via DiskANN-style indexes or in object storage, paged in on demand — trading latency for dollars on the long tail.

6

Deep dives — where Staff is won

WHERE STAFF IS WON

Deep dive 1: Index family and the recall-latency curve

Family
Strength
Cost / caveat
HNSW
Best in-RAM recall/latency
High RAM; graph overhead
IVF-PQ
Compact, tunable
nprobe trades recall for latency
DiskANN / Vamana
1B+ on SSD
Higher latency; SSD-bound
SPFresh
Fresh + on-disk updates
Newer, more operational complexity

HNSW knobs: M (edges per node) and efConstruction set graph quality at build time; efSearch is the runtime recall-latency dial — raise it for higher recall at more distance computes. IVF knobs: nlist (number of coarse cells) at build, nprobe (cells probed per query) at runtime.

For billion-scale corpora that can’t fit RAM economically, DiskANN keeps the graph on SSD and serves at higher-but-acceptable latency. Across all families the pattern is identical: search quantized codes, then rescore the top ~100–500 with full precision to recover the recall quantization gave up.

Trap: serving the final ranking off binary/PQ codes. Quantized distances are fine for the first-pass candidate set but too lossy for ordering — they tank recall. The rescore stage exists precisely to fix this.

Deep dive 2: Quantization + rescore economics

ANN over compressed codes → top 200 candidates (cheap, RAM-resident)
fetch fp32 vectors → exact distance (SSD/RAM, ~200 reads)
re-sort → top k (recall recovered)

This two-stage shape is the economic heart of the system: PQ/RaBitQ shrink the searchable structure 8–64x so it fits in RAM, and a bounded rescore over a few hundred full-precision vectors restores recall@k to target. The rescore cost is fixed per query (hundreds of vector reads), so it doesn’t scale with corpus size.

Deep dive 3: Filtered search and the selectivity cliff

The hardest interaction in the system. Three strategies:

Strategy
When it works
Failure mode
Pre-filter (filter then ANN)
Low selectivity (few pass)
ANN graph fragments → near brute force
Post-filter (ANN then filter)
High selectivity (most pass)
May not fill k; re-query
Filtered ANN (in-traversal)
Middle ground
Engine-specific complexity

The cliff: when a filter is very selective (say 0.1% of vectors pass), pre-filtering leaves the HNSW graph so sparse that traversal collapses toward brute force, spiking latency 10–100x. Post-filtering on the same query throws away almost everything and fails to fill k. The fix is adaptive routing: estimate selectivity from the inverted index and choose pre-filter, in-traversal filtering, or brute-force-over-the-filtered-set accordingly. Tenant/ACL filters are exactly the low-selectivity case — so isolation correctness and the latency cliff are the same problem.

Deep dive 4: Incremental updates and compaction

Writes go to fresh segments queryable in seconds; background compaction merges segments and rebuilds graphs without blocking queries. Tombstone deletes are reclaimed during compaction. The tuning tension: compact too rarely and you accumulate many small segments (fan-out within a shard balloons, deleted nodes pollute recall); compact too aggressively and you burn CPU/IO rebuilding graphs. This is the freshness-vs-build-cost knob.

Trap: ignoring deletes. Tombstones that never compact degrade recall over time as queries waste their candidate budget on dead graph nodes.

Deep dive 5: Hybrid retrieval, RRF, and rerank

Dense and lexical retrieval have complementary failure modes: dense embeddings miss exact tokens, rare IDs, and out-of-distribution jargon; BM25/SPLADE miss paraphrase and semantic similarity. Run both and fuse with Reciprocal Rank Fusion:

RRF(d) = Σ_retrievers 1 / (k_rrf + rank_r(d)) # k_rrf ≈ 60

RRF fuses by rank, not raw scores, so dense cosine and BM25 magnitudes never need calibrating against each other — robust and nearly parameter-free.

When quality justifies it, a cross-encoder reranks the top 50–200 fused candidates, jointly attending to query and document for far better ordering — at +20–50 ms. Reserve it for cases where the recall/precision lift pays for the latency; it does not earn its cost on every query. Late-interaction (ColBERT-style) rerankers sit between bi-encoder speed and cross-encoder quality.

Deep dive 6: Matryoshka dimension truncation

Matryoshka-trained embeddings pack coarse-to-fine information front-loaded across dimensions, so you can truncate 3072 → 512 and still retrieve well. That’s a ~6–12x RAM cut, multiplicative with quantization. A common pattern: search on truncated dims, rescore on full dims — dimension truncation as its own first-pass/rescore split.

Deep dive 7: Measuring recall

Recall is meaningless asserted; it must be measured against exact kNN. Sample real queries, compute brute-force exact top-k offline as ground truth, and track recall@k per slice (per tenant, per language, per filter pattern). Wire this into the eval harness and the production monitor — QPS-at-recall, not QPS alone.

Trap: model/version skew. If ingest and query encoders drift even one version apart, relevance silently collapses while every systems metric looks green. Version parity is a correctness invariant, enforced by encoding queries through the same pinned model.
7

Multi-team rollout

Index builds and blue-green swaps

Build new index segments offline, validate recall against sampled exact-kNN ground truth, then blue-green swap the serving index atomically — old index stays warm as instant rollback. Never mutate the live serving structure in place during a rebuild.

The hard one: re-embedding migration

When the embedding model upgrades (v1 → v2), the entire corpus must be re-embedded — the spaces are incompatible, so a mixed-version index returns garbage. At billion scale this is a multi-day, real-money operation:

1. Dual-index: stand up the v2 index alongside v1; backfill by re-embedding from the raw vector store / source docs (the reason we keep full-precision vectors).

2. Shadow-read: mirror live query traffic to v2, encoding each query with v2, and compare recall/relevance offline — no user impact.

3. Cutover: once v2 meets the recall SLO, flip traffic (canary by tenant/percentage), keeping v1 warm for rollback until confident.

The embedding_model_version field on every vector and the pinned per-collection model are what make each stage safe and auditable.

Monitoring

Signal
Why
recall@k vs exact-kNN
The SLO; sampled continuously
QPS-at-recall, p50/p99
Capacity at fixed quality
Embedding drift
Catches silent relevance decay
Tombstone ratio / segment count
Compaction health → recall health
GPU / RAM utilization, headroom
Cost and capacity

Failover and fairness

Replica failover within a shard; rebalancing on node loss (with the drift caveat for clustered sharding); per-tenant quotas and fairness so one tenant’s burst or low-selectivity filter can’t starve others.

8

Bottlenecks & evolution

Where it breaks

  • Fan-out cost at many shards — full scatter-gather to hundreds of shards dominates tail latency. Mitigation: coarse-centroid query routing to probe only the shards likely to hold neighbors — at the cost of drift and rebalancing complexity, and a recall risk if routing is wrong.
  • RAM cost — the dominant dollar line. Push to SSD/DiskANN plus aggressive quantization and Matryoshka truncation; accept higher latency on cold partitions.
  • Filter cliffs — low-selectivity filters collapsing toward brute force. Mitigation: adaptive pre/post/in-traversal routing driven by selectivity estimates.
  • Freshness vs build cost — frequent compaction is expensive; rare compaction hurts recall. Mitigation: in-place / fresh-update indexes (SPFresh-style) to narrow the gap.
  • Re-embedding cost — re-encoding billions on every model bump. Mitigation: Matryoshka (cheaper to store/migrate) and distilled/smaller encoders.
  • Rerank latency — cross-encoders are expensive per query. Mitigation: late-interaction (ColBERT-style) and learned-sparse retrieval that move quality earlier and cheaper.

Trends to flag

  • RaBitQ / 1-bit quantization adoption for first-pass candidate generation (always with rescore).
  • GPU-resident indexes for very high QPS at fixed recall.
  • Disaggregated storage — compute/index separated from a cold raw-vector tier for elastic scaling and cheaper cold capacity.

Summary

1. The spine is a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — a read-heavy, replicated datastore with a segment write path and background compaction, much like a search engine.

2. Recall is a first-class SLO with an error budget, measured against sampled exact-kNN — not an assumption. Always quote QPS-at-a-fixed-recall; a QPS number without a recall is meaningless.

3. Quantization + full-precision rescore is the economic enabler: compress codes 8–64x so the index fits in RAM, then rescore the top hundreds to win recall back. Capacity math (1B fp32 ≈ 4–5 TB) forces both sharding and quantization.

4. Design filtered and hybrid retrieval as first-class: handle the selectivity cliff with adaptive routing (it’s the same problem as ACL/tenant isolation), and fuse dense + sparse via RRF because their failure modes are complementary.

5. Have a real re-embedding migration story — dual-index, shadow-read, cutover — backed by an embedding_model_version field and strict ingest/query version parity, the invariant whose violation silently destroys relevance.

6. Role split: the SDE owns the systems spine (router, scatter-gather, compaction, storage hierarchy); the MLE owns the recall/quantization frontier and the embedding lifecycle. A switcher should lead with the partitioned-datastore design they already know and flag embedding-model/dimension and the recall operating point as the two ML calls to validate — not bluff the ANN math.

Rubric — Senior vs Staff

Dimension
Senior signal
Staff signal
Index family
Picks HNSW and tunes ef.
Chooses HNSW vs IVF-PQ vs DiskANN per shard on the recall-latency-memory curve and justifies the tradeoff.
Capacity & sharding
Knows vectors need many machines.
Derives 1B fp32 HNSW = ~6–8 TB → must shard; designs scatter-gather with partial top-k merge and partial-failure handling.
Quantization
Mentions PQ to save memory.
Layers SQ8 hot / PQ or RaBitQ binary cold, always with a full-precision rescore of the top ~100–500; never ranks off binary codes alone.
Filtering
Applies metadata filters after search.
Routes by selectivity (integrated/pre/post), handles the low-selectivity cliff, and enforces ACL/tenant isolation at retrieval time.
Recall as SLO
Targets high recall.
Quotes QPS at a fixed recall@k, monitors recall against sampled exact-kNN ground truth, and treats it with error budgets like latency.
Hybrid & rerank
Uses dense embeddings.
Fuses dense + sparse (BM25/SPLADE) via RRF for complementary failure modes; adds a cross-encoder rerank only where latency budget justifies it.
Freshness & migration
Rebuilds the index periodically.
Segment-based incremental writes + compaction; plans the re-embedding migration (dual-index, shadow-read, cutover) on model upgrade.
★ MORE WALKTHROUGHS

Want more breakdowns like this?

Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.

Join Free Early Access →