Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Approximate Nearest Neighbour (ANN) Search?

Approximate Nearest Neighbour (ANN) search finds vectors that are close to a query vector in high-dimensional space — not necessarily the exact closest, but close enough — in milliseconds rather than the seconds or minutes exact search would require. It is the core retrieval algorithm inside every vector database and enables semantic search to scale to millions or billions of documents.


Why is approximate search necessary?

Exact nearest neighbour search — comparing the query vector to every stored vector — has O(n) complexity. For 10 million 768-dimensional vectors, this requires ~30 billion floating point operations per query. At millisecond latency targets, this is infeasible.

ANN algorithms trade a small amount of recall (occasionally missing a relevant result) for a massive reduction in search time — typically achieving 95–99% of exact recall at 10–100× the speed. For semantic search, this trade-off is almost always worthwhile: missing 2% of results is imperceptible to users, while 100ms vs 10 seconds of latency is not.


How do ANN algorithms work?

There are three main families of ANN algorithms:

HNSW (Hierarchical Navigable Small World)

The most widely used algorithm in production vector databases. Builds a multi-layer graph where each node (vector) connects to its nearest neighbours. Search navigates from coarse to fine layers:

  • Start at the top layer (sparse graph, long jumps)
  • Navigate to the nearest node
  • Drop to a lower layer and repeat with finer resolution
  • Return approximate nearest neighbours from the bottom layer

Pros: Excellent recall (95–99%), fast query time, supports incremental inserts. Cons: High memory usage (stores the full graph structure). Used by: Qdrant, Weaviate, Milvus, pgvector.

IVF (Inverted File Index)

Clusters vectors into nlist centroids (using k-means). At query time, search only the nprobe nearest clusters:

Full corpus → k-means clustering → nlist clusters
Query → find nprobe nearest centroids → search only those clusters

Pros: Lower memory than HNSW, good for very large corpora. Cons: Recall depends on nprobe (higher = more accurate, slower); requires batch indexing. Used by: FAISS, Milvus.

Product Quantisation (PQ)

Compresses vectors by splitting them into sub-vectors and quantising each to a codebook. Reduces memory by 8–32× at the cost of some accuracy.

Often combined with IVF (IVF-PQ) for billion-scale retrieval.


Key ANN parameters and their trade-offs

HNSW parameters:

ParameterEffectDefault recommendation
ef_constructionIndex quality (higher = better recall, slower build)128–200
mConnections per node (higher = better recall, more memory)16–32
ef (search)Search quality (higher = better recall, slower query)64–128

IVF parameters:

ParameterEffectRecommendation
nlistNumber of clusterssqrt(n)
nprobeClusters searched per query10–50 for 95%+ recall

Recall vs latency trade-off

The fundamental ANN trade-off: higher recall requires searching more of the index, which takes longer.

Typical operating points for HNSW:

ef (search)Recall@10Query latency
32~92%Very fast
64~96%Fast
128~98%Medium
256~99.5%Slower

For most production search and RAG systems, 96–98% recall is the target. Recall drops below this start to affect user-perceived quality.


ANN search in the SIE + vector DB pipeline

[SIE: BGE-M3] → 768-dim vectors
[Vector DB: HNSW index]
Query vector → ANN search → top-K candidates
[SIE: Reranker] → precise re-scoring
Final ranked results

SIE produces the vectors; the vector database handles ANN indexing and retrieval. The reranker then improves precision over the approximate results returned by ANN.


Most production queries combine vector similarity with metadata filters: “find similar documents, but only from the legal department, from the last 90 days.”

There are two approaches:

Pre-filtering — filter the metadata first, then run ANN over the subset. Fast but loses ANN performance on small subsets.

Post-filtering — run ANN over the full index, then filter results. May return fewer than k results if many are filtered out.

Hybrid filtering (Qdrant’s approach) — estimate filter selectivity and choose the best strategy dynamically. Most production vector databases now implement this.


Frequently asked questions

How much recall do you lose with ANN vs exact search? At typical production settings, HNSW achieves 95–99% of exact recall. The 1–5% miss rate is usually imperceptible in search quality, especially when combined with a reranker that corrects ordering among retrieved results.

Is FAISS a vector database? FAISS (Facebook AI Similarity Search) is an ANN library, not a full database. It provides the indexing algorithms but lacks persistence, CRUD operations, metadata filtering, and distributed deployment. Qdrant and Milvus use FAISS-style algorithms under the hood but add the full database layer.

Can ANN search handle billion-scale corpora? Yes, with IVF-PQ or hierarchical clustering approaches. Qdrant, Milvus, and Pinecone all support billion-vector deployments. At this scale, vector quantisation (reducing memory per vector) becomes essential.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.