---
title: What is Late Interaction Retrieval?
description: Late interaction retrieval is a neural search architecture where the query and document are encoded independently into token-level vectors, and their similarity is computed at retrieval time using a fine-grained token matching function (typically MaxSim). It sits between bi-encoders (fast but less accurate) and cros...
canonical_url: https://superlinked.com/glossary/what-is-late-interaction-retrieval
last_updated: 2026-06-09
---

# What is Late Interaction Retrieval?

Late interaction retrieval is a neural search architecture where the query and document are encoded independently into token-level vectors, and their similarity is computed at retrieval time using a fine-grained token matching function (typically MaxSim). It sits between bi-encoders (fast but less accurate) and cross-encoders (accurate but slow), offering a strong accuracy-speed trade-off for high-precision retrieval.

---

## Why does late interaction matter?

There are two extremes in neural retrieval:

- **Bi-encoders** compress query and document into single vectors independently, then compare with a dot product. Fast, scalable, but loses token-level detail.
- **Cross-encoders** (rerankers) process the query and document together, attending to every token pair. Highly accurate but too slow for first-stage retrieval over millions of documents.

Late interaction is the middle ground: encode query and document independently (enabling pre-computation of document vectors), but score them with a richer token-level interaction function rather than a single dot product.

---

## How does late interaction work?

The late interaction mechanism:

1. **Encode query** → `m` token vectors: `Q = [q₁, q₂, ..., qₘ]`
2. **Encode document** → `n` token vectors: `D = [d₁, d₂, ..., dₙ]`
3. **Score with MaxSim** (ColBERT's approach):

```
Score(Q, D) = Σᵢ₌₁ᵐ max_{j=1..n} (qᵢ · dⱼ)
```

For each query token, find the most similar document token. Sum these maximum similarities across all query tokens. This means every query term gets matched to its best counterpart in the document, capturing precision that single-vector dot product misses.

The key insight: document token vectors can be pre-computed and stored (enabling ANN-style retrieval), but the interaction function is richer than a single dot product.

---

## Early interaction vs late interaction vs no interaction

| Architecture | Interaction point | Accuracy | Speed | Scalable |
|---|---|---|---|---|
| Bi-encoder | None (single vector) | Good | Fastest | ✓ |
| Late interaction (ColBERT) | At retrieval (token MaxSim) | High | Medium | ✓ |
| Cross-encoder | Inside model (attention) | Highest | Slow | ✗ (rerank only) |

"Late" refers to where the query-document interaction happens: after independent encoding, at retrieval time. "Early" would mean interaction inside the encoder (cross-encoder style).

---

## ColBERT: the dominant late interaction model

ColBERT (Contextualized Late Interaction over BERT) is the most widely deployed late interaction architecture. Key design choices:

- Per-token vectors compressed to **128 dimensions** (smaller than typical 768-dim dense vectors) to manage storage
- **MaxSim scoring** as described above
- Query tokens are **augmented with [MASK] tokens** to expand query representation
- Document vectors are pre-computed and stored in a specialised index (PLAID)

ColBERT v2 improved training with distillation from a cross-encoder and hard negative mining, substantially closing the gap with cross-encoder accuracy.

---

## Late interaction in practice with SIE

BGE-M3 implements late interaction via its multi-vector output mode. The per-token vectors it returns use ColBERT-style 128-dim compression:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item
from sie_sdk.scoring import maxsim

client = SIEClient("http://localhost:8080")

# Get token-level vectors for late interaction
doc_results = client.encode(
    "BAAI/bge-m3",
    [Item(text=d) for d in documents],
    output_types=["multivector"],
)
colbert_vecs = [r["multivector"] for r in doc_results]
# Each array has shape [num_tokens, 128]

query_result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of contract indemnification"),
    output_types=["multivector"],
    is_query=True,
)
query_mv = query_result["multivector"]
scores = maxsim(query_mv, colbert_vecs)
```

At query time, the MaxSim scoring is computed in your vector database (Qdrant supports this natively) or in your retrieval layer.

---

## Storage and infrastructure considerations

Late interaction's main cost is storage. For 1M documents averaging 256 tokens each, at 128 dims per token (float16):

```
1,000,000 × 256 tokens × 128 dims × 2 bytes = ~65GB
```

vs single-vector at 768 dims:
```
1,000,000 × 768 dims × 4 bytes = ~3GB
```

Strategies to manage this:
- Use **token compression**: remove less informative tokens (punctuation, stopwords)
- Apply to a **subset of corpus**: use late interaction for high-value documents, single-vector for the rest
- Use **quantisation**: store vectors in int8 instead of float32

---

## Frequently asked questions

**Is late interaction better than reranking?**
Late interaction is a first-stage retrieval method that scales to millions of documents. Reranking (cross-encoder) is a second-stage method that processes a small shortlist. For maximum accuracy, use late interaction for first-stage retrieval, then a cross-encoder reranker for the top 20-50 results.

**Does late interaction work with standard vector databases?**
You need a vector DB that supports multi-vector storage and MaxSim scoring. Qdrant supports this. Standard single-vector ANN indexes won't work directly, so you need to implement MaxSim on top.

**What is PLAID?**
PLAID (Performance-optimised Late Interaction Driver) is the indexing and retrieval engine for ColBERT v2, using centroid-based compression to make late interaction retrieval over billions of documents practical.

---

## Related resources

- [What is ColBERT?](/glossary/what-is-colbert)
- [What is multi-vector search?](/glossary/what-is-multi-vector-search)
- [What is a reranker?](/glossary/what-is-a-reranker)
- [What is BGE-M3?](/glossary/what-is-bge-m3)
- [SIE + Qdrant integration](/docs/integrations/qdrant)
- [Browse models on SIE](/models)
