Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is a Reranking Pipeline?

A reranking pipeline is a two-stage retrieval architecture where a fast first-stage retriever (embedding model + vector DB) fetches a broad set of candidates, and a slower but more accurate second-stage reranker (cross-encoder) re-scores and reorders them. The result is retrieval that combines the scalability of ANN search with the precision of deep query-document interaction. This is the standard approach for production search and RAG systems requiring high accuracy.


Why use a reranking pipeline instead of just a retriever?

Embedding models are bi-encoders: they encode query and documents independently and compare vectors. This is fast and scalable, but misses fine-grained relevance signals that require seeing the query and document together.

A reranker processes them jointly and catches what the retriever misses:

  • A document about “Python snakes” retrieved for a query about “Python programming”
  • A legal clause that is semantically nearby but not specifically relevant to the query’s precise requirement
  • A document that matches the topic but answers a different question

Adding a reranker to the pipeline improves precision significantly at the cost of extra latency (typically 50-200ms), which is acceptable for most production search and RAG workloads.


How does a reranking pipeline work?

User query
[Embedding model] ← encodes query to vector
[Vector database] ← ANN search, returns top-100 candidates
[Reranker] ← scores each (query, candidate) pair jointly
Top-K reranked results
[LLM / answer generation] (for RAG)

The reranker only processes the top-N candidates from first-stage retrieval (typically 20-100), not the full corpus, making it tractable despite its higher per-pair cost.


Building a reranking pipeline with SIE

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
query = "What are the indemnification obligations in a SaaS agreement?"
# Stage 1: fast embedding retrieval
query_vector = client.encode("BAAI/bge-m3", Item(text=query), is_query=True)["dense"]
candidates = vector_db.search(query_vector, top_k=50)
candidate_texts = [c.text for c in candidates]
# Stage 2: rerank with cross-encoder
score_result = client.score(
"BAAI/bge-reranker-v2-m3",
Item(text=query),
[Item(id=str(i), text=t) for i, t in enumerate(candidate_texts)],
)
id_to_text = {str(i): t for i, t in enumerate(candidate_texts)}
# Take top 5 for LLM context (scores sorted by relevance, rank 0 = best)
top_chunks = [id_to_text[e["item_id"]] for e in score_result["scores"][:5]]

Both the embedding model and reranker run on the same SIE cluster: one deployment, two models, all within your cloud account.


First-stage retrieval size: how many candidates to fetch?

The number of candidates passed to the reranker affects both quality and latency:

Candidates (k)Recall improvementReranker latency
10Baseline~20ms
20+5-10%~40ms
50+10-15%~100ms
100+12-18%~200ms

More candidates → higher recall (more relevant docs in the pool) → better reranker output. The diminishing returns typically plateau around 50-100 candidates. For latency-sensitive applications, 20-50 is a practical sweet spot.


Which reranker should you use?

ModelSizeAccuracyLatencyBest for
BGE-reranker-base110MGoodFastHigh-throughput production
BGE-reranker-large335MBetterMediumBalanced production
BGE-reranker-v2-m3570MHighMediumMultilingual
BGE-reranker-v2-gemma2.5BHighestSlowerMaximum accuracy
Jina Reranker v2137MGoodFastLightweight option

For most production RAG systems, BGE-reranker-v2-m3 provides the best accuracy-latency trade-off, especially if your content is multilingual.


Reranking pipeline vs hybrid search: which to prioritise?

These are complementary, not competing:

  • Hybrid search improves first-stage recall: more relevant documents enter the candidate pool
  • Reranking improves precision: the right documents are at the top of the final list

The optimal pipeline for high-accuracy production systems is:

Hybrid retrieval (dense + sparse) → Reranker → LLM generation

Start with a dense retrieval + reranker pipeline. Add hybrid search once you’ve validated the retrieval quality improvement justifies the additional complexity.


Measuring reranking pipeline quality

MetricWhat it measures
NDCG@KQuality of ranking, relevant docs scored higher
MRR@KHow high the first relevant result appears
Precision@KOf top-K results, fraction that are relevant
Recall@K (pre-rerank)Coverage before reranking: are relevant docs in the pool?

Measure recall@100 before reranking and precision@5 after. This tells you whether your first-stage retrieval is finding relevant docs (recall) and whether your reranker is surfacing them at the top (precision).


Frequently asked questions

Does reranking significantly increase latency? With 50 candidates and BGE-reranker-v2-m3 on SIE’s GPU, reranking adds ~80-120ms. For most search and RAG applications this is acceptable given the precision gains. For sub-50ms latency requirements, use a smaller reranker or fewer candidates.

Can I use a reranker without a vector database? Yes. You can pass any list of documents to the reranker: BM25 results, keyword search results, or a manually curated list. The reranker doesn’t care how the candidates were retrieved.

Should the reranker model match the embedding model? No, they operate independently. Using BGE-M3 for embedding and BGE-reranker-v2-gemma for reranking is a valid and high-performing combination.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.