Skip to content
Why did we open-source our inference engine? Read the post

TEI → SIE

Text Embeddings Inference is HuggingFace’s single-model embedding/reranking server. The migration to SIE is the headline use case for platform engineers: N TEI containers → 1 SIE cluster.

  • Multi-model in one process. SIE serves dense, sparse, multivector, rerank, and vision models from one cluster. TEI is one model per container, so N models means N deployments, N health checks, N autoscalers.
  • Query-time model selection. Choose the model on every request. Adding a new model in TEI means redeploying a container; in SIE it means hot-loading via the config API.
  • Typed sparse + multivector outputs. TEI exposes /embed_sparse and /embed_all as separate endpoints; outputs aren’t typed. SIE has typed dense, sparse, and multivector outputs in one call.
  • Apple Silicon. SIE runs on --device mps for local development. TEI’s CPU image works on macOS but is significantly slower.

”Why not just put N TEI containers behind one ingress?”

Section titled “”Why not just put N TEI containers behind one ingress?””

Fair question. For two or three stable models, that’s a perfectly good answer and you should not migrate. SIE earns its keep when you have:

  • Several models in active use, where the per-container fixed overhead (RAM, sidecar tax, scrape targets, alert routes) starts to dominate.
  • A long tail of “sometimes” models, like domain rerankers, language variants, experimental checkpoints. SIE’s LRU lets you list them all in one bundle and load on demand. N TEIs means N standing pods or N scale-to-zero cold starts.
  • Mixed modalities in the same request path (dense plus rerank, or dense plus sparse plus multivector). One round-trip to SIE replaces two or three to different TEI services with different DNS, different timeouts, different OTel spans.
  • Per-request model choice. Route English to one model and multilingual to another. With TEI you build a router. With SIE the cluster is the router.

If your shape is “two pinned models, both at high QPS, both pegging their GPUs”, N TEI behind ingress is simpler and you should keep it.

  • Model checkpoints. Every BERT / Sentence-Transformers / cross-encoder model TEI serves works in SIE on the same checkpoint, in the same vector space.
Terminal window
# One container per model
docker run -d -p 8088:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
--model-id BAAI/bge-small-en-v1.5
docker run -d -p 8089:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
--model-id BAAI/bge-reranker-v2-m3
import httpx
texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
"Mitochondria are the powerhouse of the cell.",
"The Eiffel Tower is in Paris.",
]
# Embed
embed = httpx.post("http://localhost:8088/embed",
json={"inputs": texts}).json()
# Rerank (different container)
rerank = httpx.post("http://localhost:8089/rerank",
json={"query": query, "texts": docs}).json()
Terminal window
# One cluster, both models
mise run serve -- -m BAAI/bge-small-en-v1.5,BAAI/bge-reranker-v2-m3
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
"Mitochondria are the powerhouse of the cell.",
"The Eiffel Tower is in Paris.",
]
# Embed
embed = client.encode(
"BAAI/bge-small-en-v1.5",
[Item(text=t) for t in texts],
)
# Rerank: same cluster, different model
rerank = client.score(
"BAAI/bge-reranker-v2-m3",
Item(text=query),
[Item(text=d) for d in docs],
)
TEISIE equivalent
--model-id BAAI/bge-small-en-v1.5bundle config + mise run serve
One container per modelOne cluster, model selected per request
POST /embedclient.encode(model, items)
POST /rerankclient.score(model, query, items)
POST /embed_sparseclient.encode(..., output_types=["sparse"])
POST /embed_all (multivector)client.encode(..., output_types=["multivector"])
--auto-truncate / --max-batch-tokensPer-model in SIE bundle config
/v1/embeddings (OpenAI-compatible, optional)/v1/embeddings on SIE (always-on)
--dtype float16 / bfloat16Per-model in adapter config
/health and /metricsSame paths on SIE; pre-built Grafana dashboards available

SIE’s typed outputs replace TEI’s separate endpoints:

# Sparse (SPLADE)
sparse = client.encode("naver/splade-v3", item, output_types=["sparse"])
# sparse["sparse"] is a SparseVector with .indices and .values
# Multivector (ColBERT)
mv = client.encode("jinaai/jina-colbert-v2", item, output_types=["multivector"])
# mv["multivector"] is np.ndarray of shape [n_tokens, dim]

No when you stay on the same checkpoint. ~1e-3 cosine drift between TEI’s backend (Candle / CTranslate2 / ONNX, depending on flags) and SIE’s PyTorch backend is well below any retrieval-quality threshold.

Terminal window
# Bring up TEI on a known checkpoint.
docker run -d -p 8088:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
--model-id sentence-transformers/all-MiniLM-L6-v2
# Bring up SIE with the same checkpoint and a reranker.
mise run serve -- \
-m sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-reranker-v2-m3

Run the ‘before’ and ‘after’ snippets from this page against both. On the same checkpoint, expect cosine at or above 0.999.

Contact us

Tell us about your use case and we'll get back to you shortly.