Search & Retrieval

What is a Reranker?

A reranker is a model that takes a query and a set of candidate results from an initial retrieval step, and re-scores them for relevance. Unlike embedding models that encode query and documents independently, rerankers compare them jointly, producing more accurate relevance scores at the cost of higher latency.

Why does reranking matter?

First-stage retrieval (semantic or keyword search) is optimised for speed and scale. It retrieves the top-k results from millions of documents in milliseconds. But speed comes at a cost: embedding-based retrieval encodes the query and documents separately, so it can miss subtle relevance signals.

A reranker fixes this. It sees the query and each candidate document together, allowing it to reason about their relationship directly. This two-stage approach (retrieve broadly, then rerank precisely) is the standard pattern in production search and RAG systems.

How does a reranker work?

Rerankers are typically cross-encoder models. Given a (query, document) pair, the model outputs a single relevance score:

Retrieve: use a fast vector search to get top-100 candidates
Rerank: pass each (query, candidate) pair through the cross-encoder
Return: serve the top-k reranked results

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# First-stage: fast embedding retrieval
candidates = vector_db.search(query_vector, top_k=100)
candidate_texts = [c.text for c in candidates]

# Second-stage: rerank for precision
score_result = client.score(
    "BAAI/bge-reranker-v2-m3",
    Item(text="indemnification clause"),
    [Item(id=str(i), text=t) for i, t in enumerate(candidate_texts)],
)
id_to_candidate = {str(i): c for i, c in enumerate(candidates)}
results = [id_to_candidate[e["item_id"]] for e in score_result["scores"][:10]]

When should you use a reranker?

A reranker is worth adding when:

Precision matters more than raw speed: e.g. legal search, medical RAG, customer support
Your retrieval recall is good but top results are noisy: reranking cleans up the final ordering
You’re building a RAG pipeline: the documents fed to an LLM must be highly relevant; irrelevant context degrades answer quality
Query complexity is high: long, nuanced queries benefit most from joint query-document scoring

You generally don’t need a reranker for simple lookup tasks or when latency budgets are very tight.

Reranker vs embedding model: key differences

	Embedding model	Reranker
Architecture	Bi-encoder	Cross-encoder
Encodes	Query and docs independently	Query + doc jointly
Speed	Fast (pre-compute doc vectors)	Slower (runtime per pair)
Accuracy	Good	Higher
Use in pipeline	First-stage retrieval	Second-stage reranking

Which reranker models does SIE support?

SIE supports leading open-source rerankers including:

BGE-Reranker-v2-M3: multilingual, strong general-purpose performance
BGE-Reranker-v2-gemma: higher accuracy for complex queries
Jina Reranker v2: lightweight, fast

All models run in your own AWS or GCP environment, with no data sent to external APIs. You can hot-swap models without downtime.

Frequently asked questions

Does using a reranker significantly increase latency? Reranking 100 candidates typically adds 50-200ms depending on model size and hardware. For most search and RAG applications this is acceptable given the accuracy gains. SIE’s GPU batching minimises this overhead.

Can I use a reranker without a vector database? Yes. You can rerank any list of documents, including keyword search results from Elasticsearch or BM25.

Do I need to fine-tune a reranker for my domain? Out-of-the-box rerankers perform well for general queries. For specialised domains (legal, medical, code), fine-tuned or LoRA-adapted rerankers improve significantly. SIE supports LoRA hot-loading for this purpose.