Skip to content
Why did we open-source our inference engine? Read the post

Overview

Reranking improves search quality by scoring query-document pairs with cross-attention. Cross-encoders see both query and document together, enabling deeper semantic matching than embedding similarity alone.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
query = Item(text="What is machine learning?")
items = [
Item(text="Machine learning is a subset of AI that learns from data."),
Item(text="The weather forecast predicts rain tomorrow."),
Item(text="Deep neural networks power modern ML systems."),
]
result = client.score("BAAI/bge-reranker-v2-m3", query, items)
for entry in result["scores"]:
print(f"Rank {entry['rank']}: {entry['score']:.3f}")

Use reranking when:

  • First-stage retrieval returns good candidates but imperfect ordering
  • You retrieve 50-100 candidates and want the top 10
  • Query-document relevance requires deep understanding

Skip reranking when:

  • You need sub-10ms latency (reranking adds 20-100ms)
  • Your retrieval is already high quality
  • You’re processing millions of documents (rerank a subset instead)

The standard pattern: retrieve many candidates with embeddings, rerank the top-k with a cross-encoder.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Stage 1: Retrieve candidates with embeddings
query_text = "What is machine learning?"
query_embedding = client.encode(
"BAAI/bge-m3",
Item(text=query_text),
is_query=True,
)
# ... search your vector database, get top 100 candidates ...
# Stage 2: Rerank top candidates with IDs for tracking
query = Item(text=query_text)
candidates = [
Item(id=f"doc-{i}", text=doc["text"])
for i, doc in enumerate(top_100_docs)
]
result = client.score("BAAI/bge-reranker-v2-m3", query, candidates)
# Get top 10 by item_id after reranking
top_10_ids = [entry["item_id"] for entry in result["scores"][:10]]

The ScoreResult contains:

FieldTypeDescription
modelstrModel used for scoring
query_idstr | NoneQuery ID if provided
scoreslist[ScoreEntry]Scored and ranked results

Each ScoreEntry contains:

FieldTypeDescription
item_idstr | NoneDocument ID (from input or auto-generated as item-N)
scorefloatRelevance score (higher = more relevant)
rankintRank position (0 = most relevant)
result = client.score("BAAI/bge-reranker-v2-m3", query, items)
# Scores are pre-sorted by relevance (highest first)
for entry in result["scores"]:
print(f"Rank {entry['rank']}: {entry['item_id']} score={entry['score']:.3f}")

Track items through reranking with IDs:

query = Item(id="q1", text="What is Python?")
items = [
Item(id="doc-1", text="Python is a programming language."),
Item(id="doc-2", text="Snakes are reptiles."),
Item(id="doc-3", text="Python was created by Guido van Rossum."),
]
result = client.score("BAAI/bge-reranker-v2-m3", query, items)
for entry in result["scores"]:
print(f"{entry['item_id']}: rank={entry['rank']}, score={entry['score']:.3f}")
# doc-1: rank=0, score=0.891
# doc-3: rank=1, score=0.756
# doc-2: rank=2, score=0.012

The server defaults to msgpack. For JSON responses:

Terminal window
curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"query": {"text": "What is machine learning?"},
"items": [
{"text": "Machine learning uses algorithms to learn from data."},
{"text": "The weather is sunny today."}
]
}'

Response:

{
"model": "BAAI/bge-reranker-v2-m3",
"scores": [
{"item_id": "item-0", "score": 0.891, "rank": 0},
{"item_id": "item-1", "score": 0.023, "rank": 1}
]
}

Batch size matters. Cross-encoders process query-document pairs. 100 documents = 100 forward passes. Keep candidate sets reasonable (50-200).

Latency vs quality. Smaller models (MiniLM) are faster but less accurate. Larger models (BGE-reranker-v2-m3) give better quality at higher latency.

GPU utilization. Reranking benefits from batching. The server batches concurrent requests automatically.

Contact us

Tell us about your use case and we'll get back to you shortly.