Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is ColBERT?

ColBERT (Contextualized Late Interaction over BERT) is a neural retrieval model that encodes queries and documents into token-level vector representations and scores their relevance using a MaxSim operation that finds the best-matching document token for each query token, then sums those scores. It achieves near cross-encoder accuracy at a fraction of the latency by pre-computing document representations offline.


Why was ColBERT developed?

Before ColBERT, neural retrieval faced a fundamental trade-off:

  • Bi-encoders were fast enough for production but lost accuracy by compressing meaning into a single vector
  • Cross-encoders were accurate but required processing every (query, document) pair at query time, which is too slow for retrieval over millions of documents

ColBERT, introduced by Khattab and Zaharia at Stanford in 2020, broke this trade-off. By deferring interaction to a lightweight MaxSim operation on pre-computed token vectors, it achieves accuracy close to cross-encoders while remaining scalable.


How does ColBERT score relevance?

ColBERT uses MaxSim (Maximum Similarity) scoring:

Score(Q, D) = Σᵢ max_j (qᵢ · dⱼ)

Concretely:

  1. Query Q is encoded into m token vectors: [q₁, ..., qₘ]
  2. Document D is encoded into n token vectors: [d₁, ..., dₙ]
  3. For each query token qᵢ, find the document token dⱼ with the highest dot product similarity
  4. Sum all the maximum similarities

This means every query term is matched to its best counterpart in the document. A precise technical term in the query will find an exact match in the document even if the overall document vectors are only moderately similar.


ColBERT architecture details

Token dimension compression: ColBERT projects token representations from 768 dims (BERT hidden size) down to 128 dims. This reduces storage by 6× while retaining most retrieval quality.

Query augmentation: query tokens are padded with [MASK] tokens to a fixed length (32 by default). These masked tokens learn to represent latent query aspects not explicitly stated.

Document compression: punctuation and other low-information tokens are filtered before storing, reducing document vector counts by ~10-15%.

Training objective: ColBERT is trained with pairwise softmax cross-entropy loss over (query, positive, negative) triples. ColBERT v2 uses knowledge distillation from a cross-encoder plus hard negative mining for substantially improved accuracy.


ColBERT v1 vs ColBERT v2

ColBERT v1ColBERT v2
TrainingPairwise lossDistillation + hard negatives
IndexingFlat indexPLAID (centroid compression)
Storage efficiencyModerateMuch better
Accuracy (BEIR)GoodSignificantly better
Production-ready✓ (preferred)

ColBERT v2 with the PLAID indexing engine is the current production standard.


How does BGE-M3 implement ColBERT-style retrieval?

BGE-M3 incorporates a multi-vector retrieval head that produces ColBERT-compatible token vectors alongside its dense and sparse outputs. This means you don’t need a separate ColBERT model: a single BGE-M3 inference call returns all three representations:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.encode(
"BAAI/bge-m3",
Item(text="indemnification clause in software contracts"),
output_types=["dense", "sparse", "multivector"],
is_query=True,
)
# Multi-vector (ColBERT-style) token embeddings: shape [num_tokens, 128]
mv = result["multivector"]
print(f"Query encoded to {mv.shape[0]} token vectors of dim {mv.shape[1]}")

ColBERT vs dense retrieval vs reranker

Dense retrievalColBERTCross-encoder reranker
Vectors per item1N (token-level)None (processes pairs)
Pre-computable
First-stage retrieval
Second-stage rerankingPossible
Storage costLowHighNone
AccuracyGoodHighHighest

The recommended production pipeline for maximum accuracy: ColBERT first-stage retrieval → cross-encoder reranker → LLM generation (RAG).


Practical considerations

When ColBERT is worth it:

  • High-stakes retrieval where missing a relevant document has real cost (legal, medical, compliance)
  • Long documents where single-vector compression loses too much detail
  • Queries with precise technical terms requiring exact matching

When to skip ColBERT:

  • General-purpose search where dense + reranker is sufficient
  • Storage-constrained environments (ColBERT uses 20-80× more storage than dense)
  • Real-time encoding requirements (per-token vectors are larger to transfer)

Frequently asked questions

Is ColBERT a retrieval model or a reranker? It can function as both, but its primary use is as a first-stage retrieval model that scales to millions of documents. Its accuracy is high enough that reranking on top of ColBERT retrieval yields diminishing returns vs reranking on top of dense retrieval.

What is the PLAID indexing engine? PLAID (Performance-optimised Late Interaction Driver) compresses ColBERT document vectors using centroid clustering, reducing storage by 3-5× and retrieval latency by 4-8×. It’s the standard indexing approach for ColBERT v2 at scale.

Does Qdrant support ColBERT retrieval? Qdrant supports multi-vector storage and can be used to implement MaxSim-style retrieval. See the Qdrant integration guide for implementation details.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.