---
title: TEI → SIE
description: Replace N single-model HuggingFace TEI containers with one SIE cluster. Same checkpoints, one process for dense, sparse, multivector, and rerank.
canonical_url: https://superlinked.com/docs/migrate/tei
last_updated: 2026-05-08
---

[Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference)
is HuggingFace's single-model embedding/reranking server. The migration
to SIE is the headline use case for platform engineers:
**N TEI containers → 1 SIE cluster.**

## Why migrate

- **Multi-model in one process.** SIE serves dense, sparse, multivector,
  rerank, and vision models from one cluster. TEI is one model per
  container, so N models means N deployments, N health checks, N
  autoscalers.
- **Query-time model selection.** Choose the model on every request.
  Adding a new model in TEI means redeploying a container; in SIE it
  means hot-loading via the config API.
- **Typed sparse + multivector outputs.** TEI exposes `/embed_sparse`
  and `/embed_all` as separate endpoints; outputs aren't typed. SIE
  has typed `dense`, `sparse`, and `multivector` outputs in one call.
- **Apple Silicon.** SIE runs on `--device mps` for local development.
  TEI's CPU image works on macOS but is significantly slower.

### "Why not just put N TEI containers behind one ingress?"

Fair question. For two or three stable models, that's a perfectly good
answer and you should not migrate. SIE earns its keep when you have:

- **Several models in active use**, where the per-container fixed
  overhead (RAM, sidecar tax, scrape targets, alert routes) starts to
  dominate.
- **A long tail of "sometimes" models**, like domain rerankers,
  language variants, experimental checkpoints. SIE's LRU lets you list
  them all in one bundle and load on demand. N TEIs means N standing
  pods or N scale-to-zero cold starts.
- **Mixed modalities in the same request path** (dense plus rerank,
  or dense plus sparse plus multivector). One round-trip to SIE
  replaces two or three to different TEI services with different DNS,
  different timeouts, different OTel spans.
- **Per-request model choice.** Route English to one model and
  multilingual to another. With TEI you build a router. With SIE the
  cluster *is* the router.

If your shape is "two pinned models, both at high QPS, both pegging
their GPUs", N TEI behind ingress is simpler and you should keep it.

## What stays the same

- **Model checkpoints.** Every BERT / Sentence-Transformers / cross-encoder
  model TEI serves works in SIE on the same checkpoint, in the same
  vector space.

## Before

```bash
# One container per model
docker run -d -p 8088:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id BAAI/bge-small-en-v1.5

docker run -d -p 8089:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id BAAI/bge-reranker-v2-m3
```

```python
import httpx

texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
    "Mitochondria are the powerhouse of the cell.",
    "The Eiffel Tower is in Paris.",
]

# Embed
embed = httpx.post("http://localhost:8088/embed",
                   json={"inputs": texts}).json()

# Rerank (different container)
rerank = httpx.post("http://localhost:8089/rerank",
                    json={"query": query, "texts": docs}).json()
```

## After

```bash
# One cluster, both models
mise run serve -- -m BAAI/bge-small-en-v1.5,BAAI/bge-reranker-v2-m3
```

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

texts = ["The mitochondrion is the powerhouse of the cell."]
query = "What is the powerhouse of the cell?"
docs = [
    "Mitochondria are the powerhouse of the cell.",
    "The Eiffel Tower is in Paris.",
]

# Embed
embed = client.encode(
    "BAAI/bge-small-en-v1.5",
    [Item(text=t) for t in texts],
)

# Rerank: same cluster, different model
rerank = client.score(
    "BAAI/bge-reranker-v2-m3",
    Item(text=query),
    [Item(text=d) for d in docs],
)
```

## Mapping

| TEI                                            | SIE equivalent                                            |
|------------------------------------------------|-----------------------------------------------------------|
| `--model-id BAAI/bge-small-en-v1.5`            | bundle config + `mise run serve`                          |
| One container per model                        | One cluster, model selected per request                   |
| `POST /embed`                                  | `client.encode(model, items)`                             |
| `POST /rerank`                                 | `client.score(model, query, items)`                       |
| `POST /embed_sparse`                           | `client.encode(..., output_types=["sparse"])`             |
| `POST /embed_all` (multivector)                | `client.encode(..., output_types=["multivector"])`        |
| `--auto-truncate` / `--max-batch-tokens`       | Per-model in SIE bundle config                            |
| `/v1/embeddings` (OpenAI-compatible, optional) | `/v1/embeddings` on SIE (always-on)                       |
| `--dtype float16` / `bfloat16`                 | Per-model in adapter config                               |
| `/health` and `/metrics`                       | Same paths on SIE; pre-built Grafana dashboards available |

## Sparse and multivector

SIE's typed outputs replace TEI's separate endpoints:

```python
# Sparse (SPLADE)
sparse = client.encode("naver/splade-v3", item, output_types=["sparse"])
# sparse["sparse"] is a SparseVector with .indices and .values

# Multivector (ColBERT)
mv = client.encode("jinaai/jina-colbert-v2", item, output_types=["multivector"])
# mv["multivector"] is np.ndarray of shape [n_tokens, dim]
```

## Re-embed required?

**No** when you stay on the same checkpoint. ~1e-3 cosine drift between
TEI's backend (Candle / CTranslate2 / ONNX, depending on flags) and
SIE's PyTorch backend is well below any retrieval-quality threshold.

## Run it yourself

```bash
# Bring up TEI on a known checkpoint.
docker run -d -p 8088:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
  --model-id sentence-transformers/all-MiniLM-L6-v2

# Bring up SIE with the same checkpoint and a reranker.
mise run serve -- \
  -m sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-reranker-v2-m3
```

Run the 'before' and 'after' snippets from this page against both.
On the same checkpoint, expect cosine at or above 0.999.
