Skip to content
Why did we open-source our inference engine? Read the post

How to rerank search results with SIE

Reranking improves search quality by scoring query-document pairs with cross-attention. Unlike embedding similarity, a cross-encoder sees both the query and document together in a single forward pass, which enables deeper semantic matching and more accurate relevance scoring.

SIE’s score primitive wraps this in a single API call:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
query = Item(text="What is machine learning?")
items = [
Item(text="Machine learning is a subset of AI that learns from data."),
Item(text="The weather forecast predicts rain tomorrow."),
Item(text="Deep neural networks power modern ML systems."),
]
result = client.score("BAAI/bge-reranker-v2-m3", query, items)
for entry in result["scores"]:
print(f"Rank {entry['rank']}: {entry['score']:.3f}")

For model recommendations, see the Reranker Models page or the full model catalog.


Use reranking when:

  • First-stage retrieval returns good candidates but imperfect ordering
  • You are retrieving 50 to 100 candidates and want only the top 10
  • Query-document relevance requires deep semantic understanding

Skip reranking when:

  • You need sub-10ms latency (reranking typically adds 20 to 100ms)
  • Your retrieval quality is already high enough
  • You are processing millions of documents (rerank a subset instead)

The standard pattern is to retrieve a broad set of candidates with embeddings, then rerank only the top candidates with a cross-encoder:

# Stage 1: fast retrieval from your vector database
query_text = "What is machine learning?"
query_embedding = client.encode(
"BAAI/bge-m3",
Item(text=query_text),
is_query=True,
)
# ...search your vector DB, get top 100 candidates...
# Stage 2: accurate reranking of those candidates
result = client.score(
"BAAI/bge-reranker-v2-m3",
query=Item(text=query_text),
items=[Item(id=f"doc-{i}", text=doc["text"]) for i, doc in enumerate(top_100_docs)]
)
top_10_ids = [entry["item_id"] for entry in result["scores"][:10]]

This approach consistently improves quality without requiring you to rerank your entire corpus. See Reranker Models for recommended model pairings.


The ScoreResult contains:

FieldTypeDescription
modelstrModel used for scoring
query_idstr or NoneQuery ID if provided
scoreslist[ScoreEntry]Scored and ranked results, sorted by relevance

Each ScoreEntry contains:

FieldTypeDescription
item_idstr or NoneDocument ID from input, or auto-generated as item-N
scorefloatRelevance score (higher means more relevant)
rankintRank position (0 is most relevant)

Assign IDs to your items so you can map reranked results back to your original records:

query = Item(id="q1", text="What is Python?")
items = [
Item(id="doc-1", text="Python is a programming language."),
Item(id="doc-2", text="Snakes are reptiles."),
Item(id="doc-3", text="Python was created by Guido van Rossum."),
]
result = client.score("BAAI/bge-reranker-v2-m3", query, items)
for entry in result["scores"]:
print(f"{entry['item_id']}: rank={entry['rank']}, score={entry['score']:.3f}")
# doc-1: rank=0, score=0.891
# doc-3: rank=1, score=0.756
# doc-2: rank=2, score=0.012

The server defaults to msgpack. For JSON responses:

curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"query": {"text": "What is machine learning?"},
"items": [
{"text": "Machine learning uses algorithms to learn from data."},
{"text": "The weather is sunny today."}
]
}'

See the HTTP API Reference.


Batch size matters. Cross-encoders process one query-document pair at a time. 100 documents means 100 forward passes, so keep candidate sets reasonable (50 to 200).

Latency vs quality. Smaller reranker models are faster but less accurate. Larger models like BAAI/bge-reranker-v2-m3 give better quality at higher latency. See Reranker Models for a comparison.

GPU batching. The SIE server batches concurrent requests automatically, so GPU utilisation improves under load.


What is the difference between reranking and embedding similarity? Embedding similarity compares a query vector to document vectors independently, which is fast but less precise. Reranking (cross-encoding) processes the query and document together in one pass, allowing the model to attend to both simultaneously. This is slower but significantly more accurate.

When should I use ColBERT reranking instead of a cross-encoder? ColBERT (multi-vector reranking) is faster than cross-encoders because it pre-computes document representations. Use it when you need better-than-dense quality without the full latency of a cross-encoder. See Multi-Vector Reranking.

Which reranker model should I use? For English, mixedbread-ai/mxbai-rerank-large-v2 is a strong default. For multilingual reranking, use BAAI/bge-reranker-v2-m3. See the Reranker Models guide and model catalog.

Does SIE reranking work with LangChain or LlamaIndex? Yes. SIE reranking is available through the sie-langchain, sie-llamaindex, and sie-haystack integration packages. See Integrations for setup instructions.

Contact us

Tell us about your use case and we'll get back to you shortly.