What is a Reranking Pipeline?
A reranking pipeline is a two-stage retrieval architecture where a fast first-stage retriever (embedding model + vector DB) fetches a broad set of candidates, and a slower but more accurate second-stage reranker (cross-encoder) re-scores and reorders them. The result is retrieval that combines the scalability of ANN search with the precision of deep query-document interaction. This is the standard approach for production search and RAG systems requiring high accuracy.
Why use a reranking pipeline instead of just a retriever?
Embedding models are bi-encoders: they encode query and documents independently and compare vectors. This is fast and scalable, but misses fine-grained relevance signals that require seeing the query and document together.
A reranker processes them jointly and catches what the retriever misses:
- A document about “Python snakes” retrieved for a query about “Python programming”
- A legal clause that is semantically nearby but not specifically relevant to the query’s precise requirement
- A document that matches the topic but answers a different question
Adding a reranker to the pipeline improves precision significantly at the cost of extra latency (typically 50-200ms), which is acceptable for most production search and RAG workloads.
How does a reranking pipeline work?
User query │ ▼[Embedding model] ← encodes query to vector │ ▼[Vector database] ← ANN search, returns top-100 candidates │ ▼[Reranker] ← scores each (query, candidate) pair jointly │ ▼Top-K reranked results │ ▼[LLM / answer generation] (for RAG)The reranker only processes the top-N candidates from first-stage retrieval (typically 20-100), not the full corpus, making it tractable despite its higher per-pair cost.
Building a reranking pipeline with SIE
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
query = "What are the indemnification obligations in a SaaS agreement?"
# Stage 1: fast embedding retrievalquery_vector = client.encode("BAAI/bge-m3", Item(text=query), is_query=True)["dense"]candidates = vector_db.search(query_vector, top_k=50)candidate_texts = [c.text for c in candidates]
# Stage 2: rerank with cross-encoderscore_result = client.score( "BAAI/bge-reranker-v2-m3", Item(text=query), [Item(id=str(i), text=t) for i, t in enumerate(candidate_texts)],)id_to_text = {str(i): t for i, t in enumerate(candidate_texts)}
# Take top 5 for LLM context (scores sorted by relevance, rank 0 = best)top_chunks = [id_to_text[e["item_id"]] for e in score_result["scores"][:5]]Both the embedding model and reranker run on the same SIE cluster: one deployment, two models, all within your cloud account.
First-stage retrieval size: how many candidates to fetch?
The number of candidates passed to the reranker affects both quality and latency:
| Candidates (k) | Recall improvement | Reranker latency |
|---|---|---|
| 10 | Baseline | ~20ms |
| 20 | +5-10% | ~40ms |
| 50 | +10-15% | ~100ms |
| 100 | +12-18% | ~200ms |
More candidates → higher recall (more relevant docs in the pool) → better reranker output. The diminishing returns typically plateau around 50-100 candidates. For latency-sensitive applications, 20-50 is a practical sweet spot.
Which reranker should you use?
| Model | Size | Accuracy | Latency | Best for |
|---|---|---|---|---|
| BGE-reranker-base | 110M | Good | Fast | High-throughput production |
| BGE-reranker-large | 335M | Better | Medium | Balanced production |
| BGE-reranker-v2-m3 | 570M | High | Medium | Multilingual |
| BGE-reranker-v2-gemma | 2.5B | Highest | Slower | Maximum accuracy |
| Jina Reranker v2 | 137M | Good | Fast | Lightweight option |
For most production RAG systems, BGE-reranker-v2-m3 provides the best accuracy-latency trade-off, especially if your content is multilingual.
Reranking pipeline vs hybrid search: which to prioritise?
These are complementary, not competing:
- Hybrid search improves first-stage recall: more relevant documents enter the candidate pool
- Reranking improves precision: the right documents are at the top of the final list
The optimal pipeline for high-accuracy production systems is:
Hybrid retrieval (dense + sparse) → Reranker → LLM generationStart with a dense retrieval + reranker pipeline. Add hybrid search once you’ve validated the retrieval quality improvement justifies the additional complexity.
Measuring reranking pipeline quality
| Metric | What it measures |
|---|---|
| NDCG@K | Quality of ranking, relevant docs scored higher |
| MRR@K | How high the first relevant result appears |
| Precision@K | Of top-K results, fraction that are relevant |
| Recall@K (pre-rerank) | Coverage before reranking: are relevant docs in the pool? |
Measure recall@100 before reranking and precision@5 after. This tells you whether your first-stage retrieval is finding relevant docs (recall) and whether your reranker is surfacing them at the top (precision).
Frequently asked questions
Does reranking significantly increase latency? With 50 candidates and BGE-reranker-v2-m3 on SIE’s GPU, reranking adds ~80-120ms. For most search and RAG applications this is acceptable given the precision gains. For sub-50ms latency requirements, use a smaller reranker or fewer candidates.
Can I use a reranker without a vector database? Yes. You can pass any list of documents to the reranker: BM25 results, keyword search results, or a manually curated list. The reranker doesn’t care how the candidates were retrieved.
Should the reranker model match the embedding model? No, they operate independently. Using BGE-M3 for embedding and BGE-reranker-v2-gemma for reranking is a valid and high-performing combination.