Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Multi-Vector Search?

Multi-vector search is a retrieval technique where each document is represented by multiple vectors (one per token or passage) rather than a single fixed-size vector. At query time, the query’s token vectors are compared against all document token vectors, enabling fine-grained token-level matching that captures nuanced relevance signals that single-vector retrieval misses.


Why does multi-vector search matter?

Single-vector retrieval compresses an entire document into one vector, losing fine-grained detail in the process. A query about a specific clause in a legal contract, or a precise technical term in a research paper, may not match well against a document-level summary vector, even if the exact answer is present in the document.

Multi-vector search solves this by preserving token-level representations. The matching happens at the token level, so a specific query term can find its exact counterpart in a long document, even if the overall document is only partially relevant.


How does multi-vector search work?

Instead of pooling token representations into one vector:

  1. Encode document → retain one vector per token: [v₁, v₂, ..., vₙ]
  2. Encode query → retain one vector per token: [q₁, q₂, ..., qₘ]
  3. Score with MaxSim → for each query token, find its maximum similarity across all document tokens, then sum:
Score(Q, D) = Σᵢ max_j (qᵢ · dⱼ)

This is the ColBERT scoring mechanism. Every query token gets matched to its best corresponding document token, and these scores are summed into a final relevance score.


Multi-vector vs single-vector vs sparse retrieval

Single-vectorMulti-vector (ColBERT)Sparse (BM25)
Vectors per doc1N (one per token)Vocab-size sparse
Captures semantics✓ (token-level)
Handles exact terms
Storage costLowHighMedium
Retrieval speedFastestSlowerFast
AccuracyGoodHighestGood for keywords

Multi-vector retrieval achieves the highest accuracy but at significant storage cost: a 512-token document produces 512 vectors instead of 1.


What is BGE-M3’s multi-vector capability?

BGE-M3 is unique in supporting all three retrieval modes from a single model, including multi-vector. This means you can produce ColBERT-style multi-vector representations without a separate model:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode with multi-vector (ColBERT-style) output
results = client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
output_types=["dense", "sparse", "multivector"],
)
dense_vectors = [r["dense"] for r in results]
sparse_vectors = [r["sparse"] for r in results]
colbert_vectors = [r["multivector"] for r in results] # one [num_tokens, 128] array per doc

You can then combine all three signals for maximum retrieval accuracy, the approach used in BGE-M3’s MIRACL and BEIR benchmark results.


Multi-vector retrieval is worth the extra storage and compute when:

  • High-precision retrieval is critical: legal, medical, or compliance document search where missing a relevant clause has real consequences
  • Long documents: single vectors compress too much information out of long texts; token-level matching preserves it
  • Specific term lookup: when queries contain precise technical terms that need exact matching alongside semantic understanding
  • You’re combining with reranking: use multi-vector for first-stage retrieval to maximise recall, then a reranker for precision

For most general-purpose search, single-vector with a reranker achieves comparable quality at lower infrastructure cost.


Storage considerations for multi-vector

A 512-token document produces 512 vectors of 128 dimensions each (ColBERT uses smaller per-token dimensions). For 1 million documents:

  • Single-vector (768 dims, float32): ~3GB
  • Multi-vector ColBERT (512 tokens × 128 dims): ~256GB

This is why multi-vector is used selectively, often for a high-value subset of your corpus, with single-vector covering the rest.

Qdrant and Weaviate both support multi-vector indexing natively.


Frequently asked questions

Is multi-vector search the same as ColBERT? ColBERT is the most prominent multi-vector retrieval architecture. Multi-vector search is the broader category; ColBERT is one implementation using late interaction (MaxSim scoring).

Can I use multi-vector retrieval with any vector database? Not all vector databases support multi-vector natively. Qdrant supports it via multi-vectors. Weaviate has ColBERT support. Check your vector DB’s documentation before committing to a multi-vector approach.

Does SIE support multi-vector encoding? Yes. BGE-M3 on SIE can return ColBERT-style token vectors alongside dense and sparse representations in a single encode call.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.