TEI → SIE

Text Embeddings Inference is HuggingFace’s single-model embedding/reranking server. The migration to SIE is the headline use case for platform engineers: N TEI containers → 1 SIE cluster.

Why migrate

Multi-model in one process. SIE serves dense, sparse, multivector, rerank, and vision models from one cluster. TEI is one model per container, so N models means N deployments, N health checks, N autoscalers.
Query-time model selection. Choose the model on every request. Adding a new model in TEI means redeploying a container; in SIE it means hot-loading via the config API.
Typed sparse + multivector outputs. TEI exposes /embed_sparse and /embed_all as separate endpoints; outputs aren’t typed. SIE has typed dense, sparse, and multivector outputs in one call.
Apple Silicon. SIE runs on --device mps for local development. TEI’s CPU image works on macOS but is significantly slower.

”Why not just put N TEI containers behind one ingress?”

Fair question. For two or three stable models, that’s a perfectly good answer and you should not migrate. SIE earns its keep when you have:

Several models in active use, where the per-container fixed overhead (RAM, sidecar tax, scrape targets, alert routes) starts to dominate.
A long tail of “sometimes” models, like domain rerankers, language variants, experimental checkpoints. SIE’s LRU lets you list them all in one bundle and load on demand. N TEIs means N standing pods or N scale-to-zero cold starts.
Mixed modalities in the same request path (dense plus rerank, or dense plus sparse plus multivector). One round-trip to SIE replaces two or three to different TEI services with different DNS, different timeouts, different OTel spans.
Per-request model choice. Route English to one model and multilingual to another. With TEI you build a router. With SIE the cluster is the router.

If your shape is “two pinned models, both at high QPS, both pegging their GPUs”, N TEI behind ingress is simpler and you should keep it.

What stays the same

Model checkpoints. Every BERT / Sentence-Transformers / cross-encoder model TEI serves works in SIE on the same checkpoint, in the same vector space.

Before

# One container per model
docker run -d -p 8088:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id BAAI/bge-small-en-v1.5

docker run -d -p 8089:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id BAAI/bge-reranker-v2-m3

import httpx

texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
    "Mitochondria are the powerhouse of the cell.",
    "The Eiffel Tower is in Paris.",
]

# Embed
embed = httpx.post("http://localhost:8088/embed",
                   json={"inputs": texts}).json()

# Rerank (different container)
rerank = httpx.post("http://localhost:8089/rerank",
                    json={"query": query, "texts": docs}).json()

After

# One cluster, both models
mise run serve -- -m BAAI/bge-small-en-v1.5,BAAI/bge-reranker-v2-m3

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
    "Mitochondria are the powerhouse of the cell.",
    "The Eiffel Tower is in Paris.",
]

# Embed
embed = client.encode(
    "BAAI/bge-small-en-v1.5",
    [Item(text=t) for t in texts],
)

# Rerank: same cluster, different model
rerank = client.score(
    "BAAI/bge-reranker-v2-m3",
    Item(text=query),
    [Item(text=d) for d in docs],
)

Mapping

TEI	SIE equivalent
`--model-id BAAI/bge-small-en-v1.5`	bundle config + `mise run serve`
One container per model	One cluster, model selected per request
`POST /embed`	`client.encode(model, items)`
`POST /rerank`	`client.score(model, query, items)`
`POST /embed_sparse`	`client.encode(..., output_types=["sparse"])`
`POST /embed_all` (multivector)	`client.encode(..., output_types=["multivector"])`
`--auto-truncate` / `--max-batch-tokens`	Per-model in SIE bundle config
`/v1/embeddings` (OpenAI-compatible, optional)	`/v1/embeddings` on SIE (always-on)
`--dtype float16` / `bfloat16`	Per-model in adapter config
`/health` and `/metrics`	Same paths on SIE; pre-built Grafana dashboards available

Sparse and multivector

SIE’s typed outputs replace TEI’s separate endpoints:

# Sparse (SPLADE)
sparse = client.encode("naver/splade-v3", item, output_types=["sparse"])
# sparse["sparse"] is a SparseVector with .indices and .values

# Multivector (ColBERT)
mv = client.encode("jinaai/jina-colbert-v2", item, output_types=["multivector"])
# mv["multivector"] is np.ndarray of shape [n_tokens, dim]

Re-embed required?

No when you stay on the same checkpoint. ~1e-3 cosine drift between TEI’s backend (Candle / CTranslate2 / ONNX, depending on flags) and SIE’s PyTorch backend is well below any retrieval-quality threshold.

Run it yourself

# Bring up TEI on a known checkpoint.
docker run -d -p 8088:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id sentence-transformers/all-MiniLM-L6-v2

# Bring up SIE with the same checkpoint and a reranker.
mise run serve -- \
  -m sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-reranker-v2-m3

Run the ‘before’ and ‘after’ snippets from this page against both. On the same checkpoint, expect cosine at or above 0.999.