Search & Retrieval

What is Late Interaction Retrieval?

Late interaction retrieval is a neural search architecture where the query and document are encoded independently into token-level vectors, and their similarity is computed at retrieval time using a fine-grained token matching function (typically MaxSim). It sits between bi-encoders (fast but less accurate) and cross-encoders (accurate but slow), offering a strong accuracy-speed trade-off for high-precision retrieval.

Why does late interaction matter?

There are two extremes in neural retrieval:

Bi-encoders compress query and document into single vectors independently, then compare with a dot product. Fast, scalable, but loses token-level detail.
Cross-encoders (rerankers) process the query and document together, attending to every token pair. Highly accurate but too slow for first-stage retrieval over millions of documents.

Late interaction is the middle ground: encode query and document independently (enabling pre-computation of document vectors), but score them with a richer token-level interaction function rather than a single dot product.

How does late interaction work?

The late interaction mechanism:

Encode query → m token vectors: Q = [q₁, q₂, ..., qₘ]
Encode document → n token vectors: D = [d₁, d₂, ..., dₙ]
Score with MaxSim (ColBERT’s approach):

Score(Q, D) = Σᵢ₌₁ᵐ max_{j=1..n} (qᵢ · dⱼ)

For each query token, find the most similar document token. Sum these maximum similarities across all query tokens. This means every query term gets matched to its best counterpart in the document, capturing precision that single-vector dot product misses.

The key insight: document token vectors can be pre-computed and stored (enabling ANN-style retrieval), but the interaction function is richer than a single dot product.

Early interaction vs late interaction vs no interaction

Architecture	Interaction point	Accuracy	Speed	Scalable
Bi-encoder	None (single vector)	Good	Fastest	✓
Late interaction (ColBERT)	At retrieval (token MaxSim)	High	Medium	✓
Cross-encoder	Inside model (attention)	Highest	Slow	✗ (rerank only)

“Late” refers to where the query-document interaction happens: after independent encoding, at retrieval time. “Early” would mean interaction inside the encoder (cross-encoder style).

ColBERT: the dominant late interaction model

ColBERT (Contextualized Late Interaction over BERT) is the most widely deployed late interaction architecture. Key design choices:

Per-token vectors compressed to 128 dimensions (smaller than typical 768-dim dense vectors) to manage storage
MaxSim scoring as described above
Query tokens are augmented with [MASK] tokens to expand query representation
Document vectors are pre-computed and stored in a specialised index (PLAID)

ColBERT v2 improved training with distillation from a cross-encoder and hard negative mining, substantially closing the gap with cross-encoder accuracy.

Late interaction in practice with SIE

BGE-M3 implements late interaction via its multi-vector output mode. The per-token vectors it returns use ColBERT-style 128-dim compression:

from sie_sdk import SIEClient
from sie_sdk.types import Item
from sie_sdk.scoring import maxsim

client = SIEClient("http://localhost:8080")

# Get token-level vectors for late interaction
doc_results = client.encode(
    "BAAI/bge-m3",
    [Item(text=d) for d in documents],
    output_types=["multivector"],
)
colbert_vecs = [r["multivector"] for r in doc_results]
# Each array has shape [num_tokens, 128]

query_result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of contract indemnification"),
    output_types=["multivector"],
    is_query=True,
)
query_mv = query_result["multivector"]
scores = maxsim(query_mv, colbert_vecs)

At query time, the MaxSim scoring is computed in your vector database (Qdrant supports this natively) or in your retrieval layer.

Storage and infrastructure considerations

Late interaction’s main cost is storage. For 1M documents averaging 256 tokens each, at 128 dims per token (float16):

1,000,000 × 256 tokens × 128 dims × 2 bytes = ~65GB

vs single-vector at 768 dims:

1,000,000 × 768 dims × 4 bytes = ~3GB

Strategies to manage this:

Use token compression: remove less informative tokens (punctuation, stopwords)
Apply to a subset of corpus: use late interaction for high-value documents, single-vector for the rest
Use quantisation: store vectors in int8 instead of float32

Frequently asked questions

Is late interaction better than reranking? Late interaction is a first-stage retrieval method that scales to millions of documents. Reranking (cross-encoder) is a second-stage method that processes a small shortlist. For maximum accuracy, use late interaction for first-stage retrieval, then a cross-encoder reranker for the top 20-50 results.

Does late interaction work with standard vector databases? You need a vector DB that supports multi-vector storage and MaxSim scoring. Qdrant supports this. Standard single-vector ANN indexes won’t work directly, so you need to implement MaxSim on top.

What is PLAID? PLAID (Performance-optimised Late Interaction Driver) is the indexing and retrieval engine for ColBERT v2, using centroid-based compression to make late interaction retrieval over billions of documents practical.