Why did we open-source our inference engine? Read the post
← All Posts

Self-hosted search inference with SIE

Every search and RAG system runs the same two model calls on a loop: embed the text, then rerank the candidates. At low volume you rent both from an API and never think about it. At production volume that loop becomes your largest inference line item, and every query you send out is a query you no longer control.

SIE (the Superlinked Inference Engine) is an open-source, Apache 2.0 inference server that runs the whole retrieval stack on your own infrastructure. Dense and sparse embeddings, multi-vector ColBERT, multimodal search, and cross-encoder reranking all come from one cluster you install in your cloud, behind two primitives: encode and score. This page ties together every search capability, the models behind each one, the migration paths off rented inference, and the source.

In short:

  • Cut per-query inference cost. Move high-volume embedding and reranking off metered APIs and onto GPUs you already pay for. Superlinked’s published comparison puts reranking at roughly $8.50 per 1B tokens self-hosted, versus $87 on Cohere Rerank, and embeddings at about $0.50 instead of $20 on a managed API.
  • Keep queries and corpora inside your cloud. Search runs over your most sensitive data, and with SIE none of it is sent to a third-party endpoint.
  • Replace N single-model containers with one cluster. Dense, sparse, multi-vector, and rerank models share a single deployment instead of one container per model.

Star SIE on GitHub · Read the encode docs · Browse the model catalog

The problem: rented inference for search does not scale cleanly

Search inference is the highest-frequency model traffic most products run. When every embed and every rerank is a metered API call, three problems compound as traffic grows.

Cost. Embedding and reranking are repetitive, predictable, and enormous in volume, which is the worst possible shape for per-token pricing. Reranking is especially punishing because a cross-encoder runs one forward pass per candidate, so reranking 100 results is 100 billed calls per query. The managed-API premium on that work is large and it scales linearly with usage.

Control. Search runs over your private corpus and your users’ raw queries, which is exactly the data customers and regulators care most about. Sending it to a third-party endpoint gives up control of where that data lives and who processes it.

Portability. A retrieval stack pinned to one provider’s API does not move with you into a customer’s cloud or an air-gapped environment, which is where enterprise deployments increasingly need to run.

Self-hosting the small, task-specific models that do this work wins all three back, and the models are good enough that there is little quality reason left to rent. The catalog covers 35 encode and 14 score models, all on open checkpoints.

The cost angle: stop paying per token for embed-and-rerank

The two calls at the heart of search are the two best candidates to bring in-house, because they run constantly and do not need a frontier model.

SIE is built to push those calls through your GPUs at high utilization rather than across a metered boundary:

  • One cluster, not N containers. A single SIE instance serves dense, sparse, multi-vector, and rerank models together, loading each on demand and evicting idle ones with LRU. The usual alternative, one Text Embeddings Inference container per model, means N deployments, N health checks, and N autoscalers for the same coverage.
  • Full batches per GPU pass. SIE pulls concurrent requests from one shared work queue so mixed sizes pack into full batches. In Superlinked’s benchmarks this reaches about 89% GPU efficiency versus roughly 51% for the route-then-batch pattern, around 1.8x the throughput per GPU at the same latency.
  • Storage is a cost lever too. Multi-vector retrieval trades storage for quality, and quantization plus MUVERA give you ways to claw most of that back, so higher-quality retrieval does not force a storage bill to match.

The published cost comparison makes the per-token gap concrete: embeddings at about $0.50 per 1B tokens on your own cloud versus $20 on a managed API, and reranking at about $8.50 versus $87 on Cohere Rerank or $43 on Vertex AI Ranking. The point is not the exact figure, it is that the self-hosted line is flat against hardware while the rented line tracks your traffic.

The security and control angle: queries and corpora stay in your cloud

Search is where a product’s private data and its users’ intent meet, so it is the part of the stack where data residency matters most. With SIE, embedding and reranking run on the host that runs the cluster, and neither the corpus nor the live query is sent to an outside provider.

That control extends to operations without weakening the boundary:

  • Per-request model choice. Route one language or domain to one model and another to a different model on each request. With single-model servers you build a gateway in front; with SIE the cluster is the gateway.
  • Add and change models without sending anything out. New encoders and rerankers are hot-loaded through the Config API or a GitOps workflow, with the gateway waiting for worker acknowledgement before routing traffic.
  • Run air-gapped. Model-weight snapshots let the whole retrieval stack run from mirrored registries with no public network access.

The same cluster installs in your SaaS cloud or inside a customer’s cloud, so the data story is identical wherever you deploy.

The canonical recipe: encode, then score

Most production search is two-stage retrieval. You retrieve a broad candidate set with embeddings, then rerank the top candidates with a cross-encoder. Both stages are one SIE call.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Stage 1: embed the query, search your vector DB for ~100 candidates
q = client.encode("Qwen/Qwen3-Embedding-0.6B", Item(text="how do refunds work?"), is_query=True)
# ... retrieve top_100 from your vector database using q["dense"] ...
# Stage 2: rerank those candidates with a cross-encoder
result = client.score(
"mixedbread-ai/mxbai-rerank-large-v2",
Item(text="how do refunds work?"),
[Item(id=f"doc-{i}", text=d["text"]) for i, d in enumerate(top_100)],
)
top_10 = [entry["item_id"] for entry in result["scores"][:10]]

This lifts precision without reranking the whole corpus. Everything below is a variation on these two primitives.

Capabilities: the retrieval stack, one SDK call each

Every block is the same client.encode(...) or client.score(...) call. Swap the model identifier to swap the architecture; SIE hot-loads the weights on first use.

1. Dense embeddings

Fixed-dimension vectors that capture meaning, for semantic search, RAG, and recommendations. Encode the query side asymmetrically when the model supports it.

Models: Qwen/Qwen3-Embedding-0.6B, intfloat/multilingual-e5-large, google/embeddinggemma-300m, BAAI/bge-m3.

docs = [Item(id=f"doc-{i}", text=t) for i, t in enumerate(corpus)]
vectors = client.encode("Qwen/Qwen3-Embedding-0.6B", docs) # store in your vector DB
q = client.encode("Qwen/Qwen3-Embedding-0.6B", Item(text="how do refunds work?"), is_query=True)

Docs: Encode overview · Models: catalog

Sparse vectors assign weights to vocabulary tokens, so exact term matching (product codes, proper nouns, acronyms) works alongside semantic search. Run a dense encoder and a sparse model, then combine scores in your vector database. BAAI/bge-m3 can also emit dense and sparse in one call when you want a single-model path.

Models: naver/splade-v3, Qwen/Qwen3-Embedding-0.6B, BAAI/bge-m3, opensearch-project/opensearch-neural-sparse-*.

# Dense + sparse as two calls, then combine in your vector DB
dense = client.encode(
"Qwen/Qwen3-Embedding-0.6B",
Item(text="ACME-1000 refund policy"),
is_query=True,
)
sparse = client.encode(
"naver/splade-v3",
Item(text="ACME-1000 refund policy"),
is_query=True,
)
# Most databases support: final_score = alpha * dense_score + (1 - alpha) * sparse_score

Sparse retrieval is supported natively by Elasticsearch, OpenSearch, Qdrant, Weaviate, Milvus, and Pinecone.

Docs: Sparse & hybrid search · Models: catalog

3. Multi-vector and ColBERT

Per-token embeddings with late interaction (MaxSim) scoring capture fine-grained term matching that single-vector dense embeddings miss. The SDK ships a maxsim helper, and ColBERT models expand short queries with MASK tokens automatically.

Models: jinaai/jina-colbert-v2, lightonai/GTE-ModernColBERT-v1, answerdotai/answerai-colbert-small-v1, mixedbread-ai/mxbai-colbert-large-v1.

from sie_sdk.scoring import maxsim
q = client.encode("jinaai/jina-colbert-v2", Item(text="what is late interaction?"),
output_types=["multivector"], is_query=True)
docs = client.encode("jinaai/jina-colbert-v2", candidate_items, output_types=["multivector"])
scores = maxsim(q["multivector"], [d["multivector"] for d in docs])

If you would rather use a standard HNSW index, the muvera profile converts multi-vector output to a fixed-dimension dense vector for ColBERT-quality retrieval on databases without multi-vector support, at a documented 5 to 10% quality trade-off.

Docs: Multi-vector & ColBERT · Models: catalog

4. Multimodal embeddings

Encode images and text into a shared space for cross-modal search, so a text query can retrieve images and vice versa. Same encode call, an image-capable model.

Models: google/siglip-so400m-patch14-384, laion/CLIP-ViT-H-14-laion2B-s32B-b79K, openai/clip-vit-large-patch14.

text_vec = client.encode("google/siglip-so400m-patch14-384", Item(text="a red leather handbag"))
img_vec = client.encode("google/siglip-so400m-patch14-384",
Item(images=[{"data": img_bytes, "format": "jpeg"}]))
# text_vec and img_vec live in the same space; compare directly

Docs: Multimodal · Models: catalog

5. Quantization

Cut the storage and memory cost of your index by quantizing embeddings, so higher-quality vectors do not force a proportional storage bill. This is the lever that keeps multi-vector and large-dimension models affordable at corpus scale.

Docs: Quantization

6. Reranking

Cross-encoders see the query and document together in one forward pass, which is more accurate than comparing embeddings independently. This is stage two of the canonical recipe, and the highest-leverage quality win in most pipelines.

Models: mixedbread-ai/mxbai-rerank-large-v2, jinaai/jina-reranker-v2-base-multilingual, BAAI/bge-reranker-v2-m3, plus ColBERT multi-vector reranking.

query = Item(text="how do refunds work?")
result = client.score("mixedbread-ai/mxbai-rerank-large-v2", query, candidate_chunks)
top_ids = [entry["item_id"] for entry in result["scores"][:10]]

Docs: Score overview · Reranker models · Models: catalog

Search models, all in one cluster

Filter the full catalog by the encode and score tasks, or by output type. Every model below is served by the same SIE instance and loaded on demand.

ModelStageOutputBest for
Qwen/Qwen3-Embedding-0.6BEncodeDenseHigh throughput, small footprint
intfloat/multilingual-e5-largeEncodeDenseMultilingual dense retrieval
google/embeddinggemma-300mEncodeDenseFast, lightweight general-purpose
naver/splade-v3EncodeSparsePurpose-built sparse retrieval (SPLADE)
BAAI/bge-m3EncodeDense / Sparse / Multi-VecDense, sparse, and multi-vector in one call
jinaai/jina-colbert-v2EncodeMulti-VecLong-context ColBERT late interaction (8192)
lightonai/GTE-ModernColBERT-v1EncodeMulti-VecModernBERT late interaction, long context
answerdotai/answerai-colbert-small-v1EncodeMulti-VecSmallest, fastest ColBERT
google/siglip-so400m-patch14-384EncodeDense (multimodal)Text and image in a shared space
mixedbread-ai/mxbai-rerank-large-v2ScoreScoreEnglish cross-encoder reranking
jinaai/jina-reranker-v2-base-multilingualScoreScoreMultilingual reranking
Alibaba-NLP/gte-reranker-modernbert-baseScoreScoreLow-latency ModernBERT reranker
BAAI/bge-reranker-v2-m3ScoreScoreMultilingual cross-encoder reranking

See Choosing a Model for selection guidance and Evals for how SIE measures retrieval quality. The retrieval benchmark example compares full pipelines head to head.

Drop into your existing stack

SIE is the inference layer, not the database or the framework, so it sits behind the tools you already use.

Vector databases: Chroma, Qdrant, Weaviate, and LanceDB. Their founders describe pairing the database with SIE so indexing, scoring, filtering, and ranking models all run in one self-hosted cluster.

Frameworks: LangChain, LlamaIndex, Haystack, and DSPy, each with a retriever and reranker component backed by SIE.

There is also an always-on OpenAI-compatible /v1/embeddings endpoint, so existing embedding clients point at SIE with a URL change.

Docs: Integrations overview · VDB comparison

Migrating off rented search inference

The headline migration is N single-model containers to one cluster. If you run several Text Embeddings Inference containers, SIE serves the same checkpoints from one process, selects the model per request, and exposes typed dense, sparse, and multivector outputs in one call instead of separate endpoints. Staying on the same checkpoint means no re-embedding: the cosine drift between TEI’s backend and SIE’s PyTorch backend sits at or above 0.999, well below any retrieval-quality threshold.

The same before-and-after pattern is documented for every common source:

When to keep what you have: for two or three pinned models at high QPS, single-model containers behind an ingress are simpler. SIE earns its place once you have several models in active use, a long tail of sometimes-used rerankers or language variants, or mixed modalities in one request path.

Runnable examples

Deployment: the same layer in any cloud

SIE installs as a Kubernetes inference cluster inside the environment your application already runs in. The same Docker image runs on a laptop and in production.

# Pull and serve on your own box
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# Or deploy the cluster to your VPC
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster
# Use it from anywhere with the SDK
pip install sie-sdk

A Helm chart for EKS and GKE, Terraform modules for AWS, GCP, and Azure, and model-weight snapshots for air-gapped environments.

Docs: Deployment · Air-gapped

Search resource hub

Docs: Encode (dense) · Sparse & hybrid · Multi-vector & ColBERT · Multimodal · Quantization · Score (reranking) · Reranker models

Examples: Product search · Retrieval benchmark · MTEB model search · Taxonomy classification · All examples

Integrations: Chroma · Qdrant · Weaviate · LanceDB · LangChain · LlamaIndex

Migrate: TEI · Cohere · OpenAI · Infinity · Fastembed

Model catalog: Encode models · Score models · Full catalog

GitHub & SDK: superlinked/sie · Quickstart · Python SDK reference

FAQ

How does self-hosting search inference cut cost? Embedding and reranking are high-frequency, repetitive calls, which is the worst fit for per-token pricing. Moving them to GPUs you control turns a usage-linked bill into a fixed hardware cost. SIE serves many models from one cluster, keeps frequent ones resident, evicts idle ones with LRU, and batches concurrent requests for high GPU efficiency. Superlinked’s published comparison puts self-hosted reranking near $8.50 per 1B tokens against $87 on Cohere Rerank, and embeddings near $0.50 against $20 on a managed API.

Can I run hybrid search (dense plus sparse) self-hosted? Yes. Encode with a dense model such as Qwen/Qwen3-Embedding-0.6B and a sparse model such as naver/splade-v3, then combine them in your vector database with a weighted score. BAAI/bge-m3 can also return dense and sparse vectors in a single encode call. SPLADE and OpenSearch neural sparse models are also available. Sparse retrieval is supported natively by Elasticsearch, OpenSearch, Qdrant, Weaviate, Milvus, and Pinecone.

Is SIE a self-hosted alternative to TEI or the Cohere and OpenAI APIs? Yes. SIE replaces N single-model TEI containers with one cluster that selects the model per request and exposes typed dense, sparse, and multivector outputs. Cohere reranking and OpenAI embeddings have documented migration paths, and an OpenAI-compatible /v1/embeddings endpoint means existing clients move with a URL change. Staying on the same checkpoint needs no re-embedding.

Does SIE support ColBERT and late-interaction retrieval? Yes. Multi-vector models such as jina-colbert-v2 and GTE-ModernColBERT produce per-token embeddings, and the SDK includes a maxsim helper for late-interaction scoring. The muvera profile converts multi-vector output to a fixed-dimension dense vector so you can get ColBERT-quality retrieval on a standard HNSW index, at a documented 5 to 10% quality trade-off.

Do my queries or corpus leave my infrastructure? No. Embedding and reranking run on the host that runs the cluster, with nothing sent to a third-party endpoint. SIE installs in your SaaS cloud or a customer’s cloud, supports air-gapped deployment from mirrored registries, and lets you add or change models through the Config API without external calls.

The takeaway

Run the search models behind your product on your own terms. Get started on GitHub or read the docs.

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.