What is SIE?

SIE (Superlinked Inference Engine) is an open-source inference server for small AI models. It runs encoders, rerankers, and entity extractors on your own infrastructure, from a laptop to a production Kubernetes cluster, without managing per-model deployments or paying per-token API costs.

SIE exposes three primitives:

Encode converts text or images to vectors for semantic search and RAG
Score reranks query-document pairs for higher-precision retrieval
Extract pulls entities and structured data from unstructured text

85+ models are supported out of the box. The server handles batching, GPU sharing, and model switching automatically. Browse the full model catalog.

SIE is built by Superlinked, the team behind the Superlinked vector compute framework. Read the launch post.

Get Started

I want to…	Go to
Get my first vectors in 2 minutes	Quickstart
Embed text or images	Encode Overview
Rerank search results	Score Overview
Extract entities from text	Extract Overview
Choose the right model	Model Selection Guide
See all 85+ models	Model Catalog
Deploy to production	Deployment Overview
Connect to LangChain or LlamaIndex	Integrations

Why Does SIE Exist?

LLM inference tools are designed for one large model spread across many GPUs. Small model inference is the opposite problem: you run many models (encoders, rerankers, extractors) on one GPU and need fast switching between them.

What makes SIE different from other inference servers:

Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three uniform primitives. The server picks the best engine per model automatically.
Multi-model GPU sharing. Many models can share one GPU via LRU eviction. One SIE instance serves any model at query time without pre-loading everything.
Same code, laptop to cloud. The same Docker image runs locally and in a production Kubernetes cluster. There is no separate production mode.
Validated correctness. Every supported model has quality and latency targets checked in CI.

How Does SIE Compare to Alternatives?

	SIE	TEI (HuggingFace)	OpenAI API
Self-hosted	Yes	Yes	No
Multi-model on one GPU	Yes	No (one model per server)	N/A
Encode + Score + Extract	Yes	Encode only	Encode only
85+ supported models	Yes	Varies	Limited
Open source	Yes	Yes	No
No per-token cost	Yes	Yes	No

See the SIE vs TEI vs OpenAI benchmark for full performance numbers.

Frequently Asked Questions

What is SIE used for? SIE is used to generate embeddings for semantic search and RAG pipelines, rerank search results to improve precision, and extract entities from unstructured text. All of this runs on your own infrastructure. See superlinked.com for more on what you can build.

Does SIE support GPU inference? Yes. SIE runs on CPU or GPU. For production inference at scale, a GPU is strongly recommended. See Hardware and Capacity for GPU sizing guidance.

How many models can SIE run at the same time? SIE loads models on demand and evicts the least-recently-used models when GPU memory fills up. An L4 GPU (24GB) keeps 2 to 3 standard models hot simultaneously. All 85+ models are available at query time regardless of VRAM.

Is SIE open source? Yes. SIE is open source and available on GitHub. The core inference server is free to use. Superlinked also offers managed cloud deployment. Contact us to learn more.

How is SIE different from the Superlinked framework? The Superlinked framework is a higher-level Python SDK for building multi-attribute search and recommendation systems. SIE is the inference layer underneath it. You can use SIE standalone or as part of a full Superlinked stack.