Design a Vector Search & Embedding Retrieval System at Scale
A Staff-level walkthrough of the vector search retrieval substrate behind Google, OpenAI, and Meta's RAG, search, and recsys stacks. It builds a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — treating recall as a first-class SLO and quantization-plus-rescore as the economic enabler — and covers filtered + hybrid retrieval, freshness, and the re-embedding migration story that separates staff-level answers.
Scope & ambiguity
Let me frame this as a standalone retrieval substrate — the service that turns a query into top-k nearest neighbors over billions of embeddings — not an end-to-end RAG app. I’ll own the index, the storage hierarchy, sharded scatter-gather serving, filtered and hybrid retrieval, and the freshness/migration story. I’ll explicitly exclude LLM generation, prompt assembly, and downstream ranking business logic — those are consumers of my top-k. The single most important reframing: correctness here is probabilistic. We don’t return the nearest neighbors, we return them with a target recall@k SLO, and almost every design knob is really a recall-vs-latency-vs-dollars trade.
This is the kind of system that sits underneath search, RAG, and recsys at Google, OpenAI, Meta, and dedicated vendors like Pinecone — a shared substrate many teams query. Interviewers use it to probe whether you can run a sharded, replicated, read-heavy datastore where the answer itself is an approximation with an error budget.
Service boundary
In scope: insert/update/delete vectors, top-k ANN query, metadata filtering, hybrid dense+sparse fusion, optional rerank, freshness, and the re-embedding migration when the embedding model changes.
Out of scope: the LLM that consumes the chunks, prompt construction, the chunking policy (we accept chunks), and business ranking rules layered on top of retrieval.
Use cases
- Semantic search — natural-language query over a corpus.
- RAG retrieval — fetch grounding chunks for an LLM.
- Recsys candidate generation — user/item embedding → candidate set for a downstream ranker.
- Dedup / near-duplicate detection — find vectors within a similarity threshold.
Service contract
Given a query embedding (or text we encode server-side) plus optional metadata filters, return the top-k approximate nearest neighbors at a stated recall SLO, within a p99 latency budget, respecting tenant/ACL isolation.
Who asks this & what they probe
A strong switcher answer leads with the systems design they already own and explicitly flags the two ML-specific calls — embedding-model/dimension choice and the recall/quantization operating point — as things to validate with an MLE rather than bluffing the ANN math.
Multi-tenancy
Many collections, many tenants. Tenant isolation is a correctness and security requirement, not an afterthought: filters must enforce ACLs, and a low-selectivity tenant filter must not silently degrade into a full-corpus scan (the filter cliff — Step 6).
Requirements
Functional requirements
- Write path: insert, update, and delete vectors (with payload/metadata), at high ingest rates.
- Query: top-k approximate nearest neighbors for a query vector or text.
- Metadata filtering: structured predicates (tenant, lang, acl, timestamp, numeric ranges) combined with the vector search.
- Hybrid retrieval: fuse dense (embedding) and sparse (BM25/SPLADE lexical) results.
- Optional rerank: cross-encoder over a small candidate set when quality justifies the latency.
- Freshness: newly written vectors become queryable within a bounded delay.
Non-functional requirements
The defining mental shift: probabilistic correctness
A conventional datastore returns the correct rows. An ANN index returns the probably-nearest rows. So:
- Recall@k is an SLO with an error budget, monitored continuously against sampled exact-kNN ground truth.
- A QPS number is meaningless without a fixed recall. "12K QPS/node" at recall 0.80 and at recall 0.98 are different machines. Always quote QPS-at-recall.
- Most "tuning" is sliding along a recall-vs-latency-vs-memory-vs-dollars surface, not fixing bugs.
Back-of-envelope estimation
Why a single node is infeasible
Take 1B vectors at 1024 dimensions, fp32:
Raw fp32 alone (~4.1 TB) already blows past any commodity node’s RAM; the graph only adds to it. Two forced conclusions, and they compound:
1. Shard the vector space across many nodes.
2. Quantize so each shard’s working set fits in RAM economically.
Quantization as the economic enabler
Quantize the codes the index searches over, then rescore the top ~100–500 candidates with full-precision vectors. Quantization makes single-node economics work; rescore buys the recall back.
QPS and compute
- In-memory ANN engines deliver roughly 8K–12K QPS/node at ~0.98 recall@10, single-threaded-per-query with a few-ms budget. Scale QPS by adding read replicas.
- Each query touches O(efSearch) or O(nprobe · list_size) candidate distance computes per shard — cheap individually, but multiplied by fan-out across shards.
- Cross-encoder rerank adds 20–50 ms for 50–200 candidates on a GPU — a separate budget line.
Build and backfill throughput
- HNSW build for 100M vectors = hours; a full IVF rebuild at 1B = days.
- Embedding backfill is bounded by GPU throughput and dollars — re-embedding a billion-doc corpus is a multi-day, real-money job (Step 7).
The build-time numbers are why incremental updates are mandatory — you cannot rebuild from scratch on every write, so you need a segment-based write path with background compaction.
Memory budget lever
Matryoshka truncation (3072 → 512 dims) cuts the RAM footprint ~6x with a modest, measurable recall cost — a multiplicative cost lever stacked on top of quantization.
API design
Keep the surface minimal and make version parity and consistency explicit.
Metric and normalization
Cosine is dot product on L2-normalized vectors — normalize once at ingest and at query so the engine runs the cheaper dot-product kernel. The metric is fixed per collection.
Filter DSL
Version tagging — the migration lever
Every vector carries an embedding_model_version. A query encoded by model v2 is meaningless against a corpus embedded by v1: the spaces are unrelated, and relevance silently collapses. So query encoding happens server-side behind the same embedding service used for ingest, and the collection pins a model version. This single field is what makes the re-embedding migration (Step 7) tractable.
Consistency semantics
Offer read-your-writes for interactive flows (route the writer’s reads to the replica that has the new segment, or block on segment visibility) and eventual for bulk ingest, where a few seconds of staleness is fine and cheaper.
Data model
The system is three coupled stores, all keyed by vector id. A query reads across all three.
1. ANN index — the graph (HNSW) or coarse-quantizer + posting lists (IVF), holding the quantized codes the search traverses. This is the hot, RAM-resident structure.
2. Raw full-precision vector store — fp32/fp16 vectors. Used to rescore ANN candidates and as the source for re-embedding migrations. Can live on SSD/object storage since it’s touched only for the top ~hundreds per query.
3. Metadata/payload store + inverted index — structured fields with an inverted index for filtering and ACL enforcement.
Segment model for writes
Writes land in small, immutable segments (like Lucene). New segments are built and made queryable quickly; background compaction merges small segments into larger ones and rebuilds graph/IVF structures offline, so queries never block on builds. Deletes are tombstones (soft deletes); compaction physically removes them and reclaims graph edges.
Why tombstones matter for recall
Ignoring deletes is a silent recall bug: tombstoned vectors still occupy graph nodes and edges, so a query can spend its candidate budget traversing toward dead entries and return fewer live results than k. Compaction is a recall maintenance task, not just space reclamation.
Sharding key and replica layout
Default to random sharding for balanced fan-out, with read replicas per shard for QPS and failover. Every vector also carries embedding_model_version, which gates dual-index migrations.
High-level architecture
Ingest pipeline
Continuous micro-batching is what makes the embedding tier affordable: it lifts GPU utilization from under 20% (per-request) to over 70%, the difference between a viable and a wasteful backfill.
Query pipeline
Scatter-gather details (the SDE core)
- Scatter the encoded query to all shards (random sharding ⇒ full fan-out).
- Each shard runs filter → ANN over quantized codes → local rescore → returns its partial top-N.
- Gather into a bounded merge heap; tie-break deterministically (by id) so results are stable.
- Partial-failure handling: if a shard times out or dies, return from the survivors with a degraded flag and a coverage estimate rather than failing the whole query — recall degrades gracefully instead of the request erroring.
Control plane
Manages collection/shard/replica lifecycle, autoscaling on QPS and RAM headroom, rebalancing on node loss, and the blue-green / dual-index machinery for builds and migrations. It is off the query hot path.
Hot/cold tiers
Hot shards in RAM serve the latency SLO; cold or rarely-queried partitions live on SSD via DiskANN-style indexes or in object storage, paged in on demand — trading latency for dollars on the long tail.
Deep dives — where Staff is won
WHERE STAFF IS WONDeep dive 1: Index family and the recall-latency curve
HNSW knobs: M (edges per node) and efConstruction set graph quality at build time; efSearch is the runtime recall-latency dial — raise it for higher recall at more distance computes. IVF knobs: nlist (number of coarse cells) at build, nprobe (cells probed per query) at runtime.
For billion-scale corpora that can’t fit RAM economically, DiskANN keeps the graph on SSD and serves at higher-but-acceptable latency. Across all families the pattern is identical: search quantized codes, then rescore the top ~100–500 with full precision to recover the recall quantization gave up.
Trap: serving the final ranking off binary/PQ codes. Quantized distances are fine for the first-pass candidate set but too lossy for ordering — they tank recall. The rescore stage exists precisely to fix this.
Deep dive 2: Quantization + rescore economics
This two-stage shape is the economic heart of the system: PQ/RaBitQ shrink the searchable structure 8–64x so it fits in RAM, and a bounded rescore over a few hundred full-precision vectors restores recall@k to target. The rescore cost is fixed per query (hundreds of vector reads), so it doesn’t scale with corpus size.
Deep dive 3: Filtered search and the selectivity cliff
The hardest interaction in the system. Three strategies:
The cliff: when a filter is very selective (say 0.1% of vectors pass), pre-filtering leaves the HNSW graph so sparse that traversal collapses toward brute force, spiking latency 10–100x. Post-filtering on the same query throws away almost everything and fails to fill k. The fix is adaptive routing: estimate selectivity from the inverted index and choose pre-filter, in-traversal filtering, or brute-force-over-the-filtered-set accordingly. Tenant/ACL filters are exactly the low-selectivity case — so isolation correctness and the latency cliff are the same problem.
Deep dive 4: Incremental updates and compaction
Writes go to fresh segments queryable in seconds; background compaction merges segments and rebuilds graphs without blocking queries. Tombstone deletes are reclaimed during compaction. The tuning tension: compact too rarely and you accumulate many small segments (fan-out within a shard balloons, deleted nodes pollute recall); compact too aggressively and you burn CPU/IO rebuilding graphs. This is the freshness-vs-build-cost knob.
Trap: ignoring deletes. Tombstones that never compact degrade recall over time as queries waste their candidate budget on dead graph nodes.
Deep dive 5: Hybrid retrieval, RRF, and rerank
Dense and lexical retrieval have complementary failure modes: dense embeddings miss exact tokens, rare IDs, and out-of-distribution jargon; BM25/SPLADE miss paraphrase and semantic similarity. Run both and fuse with Reciprocal Rank Fusion:
RRF fuses by rank, not raw scores, so dense cosine and BM25 magnitudes never need calibrating against each other — robust and nearly parameter-free.
When quality justifies it, a cross-encoder reranks the top 50–200 fused candidates, jointly attending to query and document for far better ordering — at +20–50 ms. Reserve it for cases where the recall/precision lift pays for the latency; it does not earn its cost on every query. Late-interaction (ColBERT-style) rerankers sit between bi-encoder speed and cross-encoder quality.
Deep dive 6: Matryoshka dimension truncation
Matryoshka-trained embeddings pack coarse-to-fine information front-loaded across dimensions, so you can truncate 3072 → 512 and still retrieve well. That’s a ~6–12x RAM cut, multiplicative with quantization. A common pattern: search on truncated dims, rescore on full dims — dimension truncation as its own first-pass/rescore split.
Deep dive 7: Measuring recall
Recall is meaningless asserted; it must be measured against exact kNN. Sample real queries, compute brute-force exact top-k offline as ground truth, and track recall@k per slice (per tenant, per language, per filter pattern). Wire this into the eval harness and the production monitor — QPS-at-recall, not QPS alone.
Trap: model/version skew. If ingest and query encoders drift even one version apart, relevance silently collapses while every systems metric looks green. Version parity is a correctness invariant, enforced by encoding queries through the same pinned model.
Multi-team rollout
Index builds and blue-green swaps
Build new index segments offline, validate recall against sampled exact-kNN ground truth, then blue-green swap the serving index atomically — old index stays warm as instant rollback. Never mutate the live serving structure in place during a rebuild.
The hard one: re-embedding migration
When the embedding model upgrades (v1 → v2), the entire corpus must be re-embedded — the spaces are incompatible, so a mixed-version index returns garbage. At billion scale this is a multi-day, real-money operation:
1. Dual-index: stand up the v2 index alongside v1; backfill by re-embedding from the raw vector store / source docs (the reason we keep full-precision vectors).
2. Shadow-read: mirror live query traffic to v2, encoding each query with v2, and compare recall/relevance offline — no user impact.
3. Cutover: once v2 meets the recall SLO, flip traffic (canary by tenant/percentage), keeping v1 warm for rollback until confident.
The embedding_model_version field on every vector and the pinned per-collection model are what make each stage safe and auditable.
Monitoring
Failover and fairness
Replica failover within a shard; rebalancing on node loss (with the drift caveat for clustered sharding); per-tenant quotas and fairness so one tenant’s burst or low-selectivity filter can’t starve others.
Bottlenecks & evolution
Where it breaks
- Fan-out cost at many shards — full scatter-gather to hundreds of shards dominates tail latency. Mitigation: coarse-centroid query routing to probe only the shards likely to hold neighbors — at the cost of drift and rebalancing complexity, and a recall risk if routing is wrong.
- RAM cost — the dominant dollar line. Push to SSD/DiskANN plus aggressive quantization and Matryoshka truncation; accept higher latency on cold partitions.
- Filter cliffs — low-selectivity filters collapsing toward brute force. Mitigation: adaptive pre/post/in-traversal routing driven by selectivity estimates.
- Freshness vs build cost — frequent compaction is expensive; rare compaction hurts recall. Mitigation: in-place / fresh-update indexes (SPFresh-style) to narrow the gap.
- Re-embedding cost — re-encoding billions on every model bump. Mitigation: Matryoshka (cheaper to store/migrate) and distilled/smaller encoders.
- Rerank latency — cross-encoders are expensive per query. Mitigation: late-interaction (ColBERT-style) and learned-sparse retrieval that move quality earlier and cheaper.
Trends to flag
- RaBitQ / 1-bit quantization adoption for first-pass candidate generation (always with rescore).
- GPU-resident indexes for very high QPS at fixed recall.
- Disaggregated storage — compute/index separated from a cold raw-vector tier for elastic scaling and cheaper cold capacity.
Summary
1. The spine is a sharded, quantized, incrementally-updated ANN index fronted by scatter-gather serving — a read-heavy, replicated datastore with a segment write path and background compaction, much like a search engine.
2. Recall is a first-class SLO with an error budget, measured against sampled exact-kNN — not an assumption. Always quote QPS-at-a-fixed-recall; a QPS number without a recall is meaningless.
3. Quantization + full-precision rescore is the economic enabler: compress codes 8–64x so the index fits in RAM, then rescore the top hundreds to win recall back. Capacity math (1B fp32 ≈ 4–5 TB) forces both sharding and quantization.
4. Design filtered and hybrid retrieval as first-class: handle the selectivity cliff with adaptive routing (it’s the same problem as ACL/tenant isolation), and fuse dense + sparse via RRF because their failure modes are complementary.
5. Have a real re-embedding migration story — dual-index, shadow-read, cutover — backed by an embedding_model_version field and strict ingest/query version parity, the invariant whose violation silently destroys relevance.
6. Role split: the SDE owns the systems spine (router, scatter-gather, compaction, storage hierarchy); the MLE owns the recall/quantization frontier and the embedding lifecycle. A switcher should lead with the partitioned-datastore design they already know and flag embedding-model/dimension and the recall operating point as the two ML calls to validate — not bluff the ANN math.
Rubric — Senior vs Staff
Want more breakdowns like this?
Join free early access for upcoming RAG, LLM eval, agents, and AI infrastructure walkthroughs.