Skip to content
Why did we open-source our inference engine? Read the post

Infinity → SIE

Infinity ships an embedding, reranking, and CLIP server with an OpenAI-compatible API. SIE covers the same surface plus sparse / multivector models, multi-model serving, and managed deployment tooling.

  • One cluster for N models. Infinity is single-model per container. SIE serves all configured models from one cluster with LRU eviction.
  • Typed multi-modality outputs. Infinity is centered on dense embeddings plus cross-encoder rerank. SIE returns typed dense, sparse, and multivector outputs from a single encode call, useful when an upstream retriever wants more than one signal per document.
  • Managed deployment. SIE ships a Helm chart, KEDA autoscaler config, Grafana dashboards, and a sie-admin CLI.
  • OpenAI-compatible endpoint. Existing Infinity clients (typically the OpenAI SDK pointed at Infinity) can swap base URLs and keep working.
  • Model checkpoints. Same checkpoint, same vector space. Most Infinity-supported encoders work in SIE without re-engineering.
Terminal window
# Pin a tag in production; :latest shown for brevity.
# See https://hub.docker.com/r/michaelfeil/infinity/tags
docker run --rm -p 7997:7997 michaelfeil/infinity:latest \
v2 --model-id BAAI/bge-small-en-v1.5
from openai import OpenAI
client = OpenAI(api_key="not-needed", base_url="http://localhost:7997")
resp = client.embeddings.create(
model="BAAI/bge-small-en-v1.5",
input=["..."],
)
Terminal window
mise run serve -- -m BAAI/bge-small-en-v1.5
from openai import OpenAI
# Drop-in: keep the OpenAI SDK, change base_url.
client = OpenAI(api_key="not-needed", base_url="http://localhost:8080/v1")
# …or use the native SDK for sparse/multivector/rerank.
from sie_sdk import SIEClient
from sie_sdk.types import Item
sie = SIEClient("http://localhost:8080")
result = sie.encode("BAAI/bge-small-en-v1.5", Item(text="..."))
InfinitySIE equivalent
--model-id BAAI/bge-small-en-v1.5bundle config + mise run serve
--engine torch / optimum / ctranslate2SIE adapter selection (auto)
Multiple containers for multiple modelsSingle SIE cluster, one Helm chart
/embeddings (OpenAI-compatible)/v1/embeddings on SIE
/rerank (custom Infinity endpoint)client.score(...) (Python SDK)
/classifyclient.extract(...) with classifier model
Image inputs on /embeddings (CLIP)client.encode(model, Item(image=...))

No. Cross-backend numerical drift between Infinity (PyTorch, CTranslate2, or ONNX, depending on flags) and SIE (PyTorch) sits at ~1e-3 cosine, well below any retrieval quality threshold.

Terminal window
# Pin a tag in production; :latest shown for brevity.
docker run -d -p 7997:7997 michaelfeil/infinity:latest \
v2 --model-id sentence-transformers/all-MiniLM-L6-v2
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
uv add openai

Run the ‘before’ and ‘after’ snippets from this page against both. Expected: identical dim (384), cosine at or above 0.999.

Contact us

Tell us about your use case and we'll get back to you shortly.