Skip to content
Why did we open-source our inference engine? Read the post

What is SIE?

SIE (Superlinked Inference Engine) is an open-source inference server for small AI models. It runs encoders, rerankers, and entity extractors on your own infrastructure, from a laptop to a production Kubernetes cluster, without managing per-model deployments or paying per-token API costs.

SIE exposes three primitives:

  • Encode converts text or images to vectors for semantic search and RAG
  • Score reranks query-document pairs for higher-precision retrieval
  • Extract pulls entities and structured data from unstructured text

85+ models are supported out of the box. The server handles batching, GPU sharing, and model switching automatically. Browse the full model catalog.

SIE is built by Superlinked, the team behind the Superlinked vector compute framework. Read the launch post.


I want to…Go to
Get my first vectors in 2 minutesQuickstart
Embed text or imagesEncode Overview
Rerank search resultsScore Overview
Extract entities from textExtract Overview
Choose the right modelModel Selection Guide
See all 85+ modelsModel Catalog
Deploy to productionDeployment Overview
Connect to LangChain or LlamaIndexIntegrations

LLM inference tools are designed for one large model spread across many GPUs. Small model inference is the opposite problem: you run many models (encoders, rerankers, extractors) on one GPU and need fast switching between them.

What makes SIE different from other inference servers:

  1. Compute engine abstraction. SIE wraps PyTorch, SGLang, and Flash Attention behind three uniform primitives. The server picks the best engine per model automatically.
  2. Multi-model GPU sharing. Many models can share one GPU via LRU eviction. One SIE instance serves any model at query time without pre-loading everything.
  3. Same code, laptop to cloud. The same Docker image runs locally and in a production Kubernetes cluster. There is no separate production mode.
  4. Validated correctness. Every supported model has quality and latency targets checked in CI.

SIETEI (HuggingFace)OpenAI API
Self-hostedYesYesNo
Multi-model on one GPUYesNo (one model per server)N/A
Encode + Score + ExtractYesEncode onlyEncode only
85+ supported modelsYesVariesLimited
Open sourceYesYesNo
No per-token costYesYesNo

See the SIE vs TEI vs OpenAI benchmark for full performance numbers.


What is SIE used for? SIE is used to generate embeddings for semantic search and RAG pipelines, rerank search results to improve precision, and extract entities from unstructured text. All of this runs on your own infrastructure. See superlinked.com for more on what you can build.

Does SIE support GPU inference? Yes. SIE runs on CPU or GPU. For production inference at scale, a GPU is strongly recommended. See Hardware and Capacity for GPU sizing guidance.

How many models can SIE run at the same time? SIE loads models on demand and evicts the least-recently-used models when GPU memory fills up. An L4 GPU (24GB) keeps 2 to 3 standard models hot simultaneously. All 85+ models are available at query time regardless of VRAM.

Is SIE open source? Yes. SIE is open source and available on GitHub. The core inference server is free to use. Superlinked also offers managed cloud deployment. Contact us to learn more.

How is SIE different from the Superlinked framework? The Superlinked framework is a higher-level Python SDK for building multi-attribute search and recommendation systems. SIE is the inference layer underneath it. You can use SIE standalone or as part of a full Superlinked stack.

Contact us

Tell us about your use case and we'll get back to you shortly.