TEI → SIE
Text Embeddings Inference is HuggingFace’s single-model embedding/reranking server. The migration to SIE is the headline use case for platform engineers: N TEI containers → 1 SIE cluster.
Why migrate
Section titled “Why migrate”- Multi-model in one process. SIE serves dense, sparse, multivector, rerank, and vision models from one cluster. TEI is one model per container, so N models means N deployments, N health checks, N autoscalers.
- Query-time model selection. Choose the model on every request. Adding a new model in TEI means redeploying a container; in SIE it means hot-loading via the config API.
- Typed sparse + multivector outputs. TEI exposes
/embed_sparseand/embed_allas separate endpoints; outputs aren’t typed. SIE has typeddense,sparse, andmultivectoroutputs in one call. - Apple Silicon. SIE runs on
--device mpsfor local development. TEI’s CPU image works on macOS but is significantly slower.
”Why not just put N TEI containers behind one ingress?”
Section titled “”Why not just put N TEI containers behind one ingress?””Fair question. For two or three stable models, that’s a perfectly good answer and you should not migrate. SIE earns its keep when you have:
- Several models in active use, where the per-container fixed overhead (RAM, sidecar tax, scrape targets, alert routes) starts to dominate.
- A long tail of “sometimes” models, like domain rerankers, language variants, experimental checkpoints. SIE’s LRU lets you list them all in one bundle and load on demand. N TEIs means N standing pods or N scale-to-zero cold starts.
- Mixed modalities in the same request path (dense plus rerank, or dense plus sparse plus multivector). One round-trip to SIE replaces two or three to different TEI services with different DNS, different timeouts, different OTel spans.
- Per-request model choice. Route English to one model and multilingual to another. With TEI you build a router. With SIE the cluster is the router.
If your shape is “two pinned models, both at high QPS, both pegging their GPUs”, N TEI behind ingress is simpler and you should keep it.
What stays the same
Section titled “What stays the same”- Model checkpoints. Every BERT / Sentence-Transformers / cross-encoder model TEI serves works in SIE on the same checkpoint, in the same vector space.
Before
Section titled “Before”# One container per modeldocker run -d -p 8088:80 \ ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \ --model-id BAAI/bge-small-en-v1.5
docker run -d -p 8089:80 \ ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \ --model-id BAAI/bge-reranker-v2-m3import httpx
texts = ["The mitochondrion is the powerhouse of the cell."]query = "What is the powerhouse of the cell?"docs = [ "Mitochondria are the powerhouse of the cell.", "The Eiffel Tower is in Paris.",]
# Embedembed = httpx.post("http://localhost:8088/embed", json={"inputs": texts}).json()
# Rerank (different container)rerank = httpx.post("http://localhost:8089/rerank", json={"query": query, "texts": docs}).json()# One cluster, both modelsmise run serve -- -m BAAI/bge-small-en-v1.5,BAAI/bge-reranker-v2-m3from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
texts = ["The mitochondrion is the powerhouse of the cell."]query = "What is the powerhouse of the cell?"docs = [ "Mitochondria are the powerhouse of the cell.", "The Eiffel Tower is in Paris.",]
# Embedembed = client.encode( "BAAI/bge-small-en-v1.5", [Item(text=t) for t in texts],)
# Rerank: same cluster, different modelrerank = client.score( "BAAI/bge-reranker-v2-m3", Item(text=query), [Item(text=d) for d in docs],)Mapping
Section titled “Mapping”| TEI | SIE equivalent |
|---|---|
--model-id BAAI/bge-small-en-v1.5 | bundle config + mise run serve |
| One container per model | One cluster, model selected per request |
POST /embed | client.encode(model, items) |
POST /rerank | client.score(model, query, items) |
POST /embed_sparse | client.encode(..., output_types=["sparse"]) |
POST /embed_all (multivector) | client.encode(..., output_types=["multivector"]) |
--auto-truncate / --max-batch-tokens | Per-model in SIE bundle config |
/v1/embeddings (OpenAI-compatible, optional) | /v1/embeddings on SIE (always-on) |
--dtype float16 / bfloat16 | Per-model in adapter config |
/health and /metrics | Same paths on SIE; pre-built Grafana dashboards available |
Sparse and multivector
Section titled “Sparse and multivector”SIE’s typed outputs replace TEI’s separate endpoints:
# Sparse (SPLADE)sparse = client.encode("naver/splade-v3", item, output_types=["sparse"])# sparse["sparse"] is a SparseVector with .indices and .values
# Multivector (ColBERT)mv = client.encode("jinaai/jina-colbert-v2", item, output_types=["multivector"])# mv["multivector"] is np.ndarray of shape [n_tokens, dim]Re-embed required?
Section titled “Re-embed required?”No when you stay on the same checkpoint. ~1e-3 cosine drift between TEI’s backend (Candle / CTranslate2 / ONNX, depending on flags) and SIE’s PyTorch backend is well below any retrieval-quality threshold.
Run it yourself
Section titled “Run it yourself”# Bring up TEI on a known checkpoint.docker run -d -p 8088:80 \ ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \ --model-id sentence-transformers/all-MiniLM-L6-v2
# Bring up SIE with the same checkpoint and a reranker.mise run serve -- \ -m sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-reranker-v2-m3Run the ‘before’ and ‘after’ snippets from this page against both. On the same checkpoint, expect cosine at or above 0.999.