Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Late Interaction Retrieval?

Late interaction retrieval is a neural search architecture where the query and document are encoded independently into token-level vectors, and their similarity is computed at retrieval time using a fine-grained token matching function (typically MaxSim). It sits between bi-encoders (fast but less accurate) and cross-encoders (accurate but slow), offering a strong accuracy-speed trade-off for high-precision retrieval.


Why does late interaction matter?

There are two extremes in neural retrieval:

  • Bi-encoders compress query and document into single vectors independently, then compare with a dot product. Fast, scalable, but loses token-level detail.
  • Cross-encoders (rerankers) process the query and document together, attending to every token pair. Highly accurate but too slow for first-stage retrieval over millions of documents.

Late interaction is the middle ground: encode query and document independently (enabling pre-computation of document vectors), but score them with a richer token-level interaction function rather than a single dot product.


How does late interaction work?

The late interaction mechanism:

  1. Encode querym token vectors: Q = [q₁, q₂, ..., qₘ]
  2. Encode documentn token vectors: D = [d₁, d₂, ..., dₙ]
  3. Score with MaxSim (ColBERT’s approach):
Score(Q, D) = Σᵢ₌₁ᵐ max_{j=1..n} (qᵢ · dⱼ)

For each query token, find the most similar document token. Sum these maximum similarities across all query tokens. This means every query term gets matched to its best counterpart in the document, capturing precision that single-vector dot product misses.

The key insight: document token vectors can be pre-computed and stored (enabling ANN-style retrieval), but the interaction function is richer than a single dot product.


Early interaction vs late interaction vs no interaction

ArchitectureInteraction pointAccuracySpeedScalable
Bi-encoderNone (single vector)GoodFastest
Late interaction (ColBERT)At retrieval (token MaxSim)HighMedium
Cross-encoderInside model (attention)HighestSlow✗ (rerank only)

“Late” refers to where the query-document interaction happens: after independent encoding, at retrieval time. “Early” would mean interaction inside the encoder (cross-encoder style).


ColBERT: the dominant late interaction model

ColBERT (Contextualized Late Interaction over BERT) is the most widely deployed late interaction architecture. Key design choices:

  • Per-token vectors compressed to 128 dimensions (smaller than typical 768-dim dense vectors) to manage storage
  • MaxSim scoring as described above
  • Query tokens are augmented with [MASK] tokens to expand query representation
  • Document vectors are pre-computed and stored in a specialised index (PLAID)

ColBERT v2 improved training with distillation from a cross-encoder and hard negative mining, substantially closing the gap with cross-encoder accuracy.


Late interaction in practice with SIE

BGE-M3 implements late interaction via its multi-vector output mode. The per-token vectors it returns use ColBERT-style 128-dim compression:

from sie_sdk import SIEClient
from sie_sdk.types import Item
from sie_sdk.scoring import maxsim
client = SIEClient("http://localhost:8080")
# Get token-level vectors for late interaction
doc_results = client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
output_types=["multivector"],
)
colbert_vecs = [r["multivector"] for r in doc_results]
# Each array has shape [num_tokens, 128]
query_result = client.encode(
"BAAI/bge-m3",
Item(text="breach of contract indemnification"),
output_types=["multivector"],
is_query=True,
)
query_mv = query_result["multivector"]
scores = maxsim(query_mv, colbert_vecs)

At query time, the MaxSim scoring is computed in your vector database (Qdrant supports this natively) or in your retrieval layer.


Storage and infrastructure considerations

Late interaction’s main cost is storage. For 1M documents averaging 256 tokens each, at 128 dims per token (float16):

1,000,000 × 256 tokens × 128 dims × 2 bytes = ~65GB

vs single-vector at 768 dims:

1,000,000 × 768 dims × 4 bytes = ~3GB

Strategies to manage this:

  • Use token compression: remove less informative tokens (punctuation, stopwords)
  • Apply to a subset of corpus: use late interaction for high-value documents, single-vector for the rest
  • Use quantisation: store vectors in int8 instead of float32

Frequently asked questions

Is late interaction better than reranking? Late interaction is a first-stage retrieval method that scales to millions of documents. Reranking (cross-encoder) is a second-stage method that processes a small shortlist. For maximum accuracy, use late interaction for first-stage retrieval, then a cross-encoder reranker for the top 20-50 results.

Does late interaction work with standard vector databases? You need a vector DB that supports multi-vector storage and MaxSim scoring. Qdrant supports this. Standard single-vector ANN indexes won’t work directly, so you need to implement MaxSim on top.

What is PLAID? PLAID (Performance-optimised Late Interaction Driver) is the indexing and retrieval engine for ColBERT v2, using centroid-based compression to make late interaction retrieval over billions of documents practical.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.