Infinity → SIE

Infinity ships an embedding, reranking, and CLIP server with an OpenAI-compatible API. SIE covers the same surface plus sparse / multivector models, multi-model serving, and managed deployment tooling.

Why migrate

One cluster for N models. Infinity is single-model per container. SIE serves all configured models from one cluster with LRU eviction.
Typed multi-modality outputs. Infinity is centered on dense embeddings plus cross-encoder rerank. SIE returns typed dense, sparse, and multivector outputs from a single encode call, useful when an upstream retriever wants more than one signal per document.
Managed deployment. SIE ships a Helm chart, KEDA autoscaler config, Grafana dashboards, and a sie-admin CLI.

What stays the same

OpenAI-compatible endpoint. Existing Infinity clients (typically the OpenAI SDK pointed at Infinity) can swap base URLs and keep working.
Model checkpoints. Same checkpoint, same vector space. Most Infinity-supported encoders work in SIE without re-engineering.

Before

# Pin a tag in production; :latest shown for brevity.
# See https://hub.docker.com/r/michaelfeil/infinity/tags
docker run --rm -p 7997:7997 michaelfeil/infinity:latest \
  v2 --model-id BAAI/bge-small-en-v1.5

from openai import OpenAI

client = OpenAI(api_key="not-needed", base_url="http://localhost:7997")
resp = client.embeddings.create(
    model="BAAI/bge-small-en-v1.5",
    input=["..."],
)

After

mise run serve -- -m BAAI/bge-small-en-v1.5

from openai import OpenAI

# Drop-in: keep the OpenAI SDK, change base_url.
client = OpenAI(api_key="not-needed", base_url="http://localhost:8080/v1")

# …or use the native SDK for sparse/multivector/rerank.
from sie_sdk import SIEClient
from sie_sdk.types import Item

sie = SIEClient("http://localhost:8080")
result = sie.encode("BAAI/bge-small-en-v1.5", Item(text="..."))

Mapping

Infinity	SIE equivalent
`--model-id BAAI/bge-small-en-v1.5`	bundle config + `mise run serve`
`--engine torch` / `optimum` / `ctranslate2`	SIE adapter selection (auto)
Multiple containers for multiple models	Single SIE cluster, one Helm chart
`/embeddings` (OpenAI-compatible)	`/v1/embeddings` on SIE
`/rerank` (custom Infinity endpoint)	`client.score(...)` (Python SDK)
`/classify`	`client.extract(...)` with classifier model
Image inputs on `/embeddings` (CLIP)	`client.encode(model, Item(image=...))`

Re-embed required?

No. Cross-backend numerical drift between Infinity (PyTorch, CTranslate2, or ONNX, depending on flags) and SIE (PyTorch) sits at ~1e-3 cosine, well below any retrieval quality threshold.

Run it yourself

# Pin a tag in production; :latest shown for brevity.
docker run -d -p 7997:7997 michaelfeil/infinity:latest \
  v2 --model-id sentence-transformers/all-MiniLM-L6-v2

mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
uv add openai

Run the ‘before’ and ‘after’ snippets from this page against both. Expected: identical dim (384), cosine at or above 0.999.