Launch February 4, 2026

Boost performance & reduce cost by self-hosting specialized AI models

By Daniel Svonava

The open-source opportunity

We’ve built TB-scale search systems for years, always on open source models. Embeddings, rerankers, classifiers, OCR. No matter the domain or language, a good OSS model either exists or is a fine-tuning run away.

🤗 now adds 100,000 new models each month:

Number of open-source models on huggingface.com

Source: Hugging Face Hub

Even the largest proprietary models now have an equally capable open source alternative within months of launch. Yet companies spend a total of tens of billions of dollars on LLM APIs.

The multi-model problem

Real AI pipelines use many specialized models working together: dense and sparse embeddings, multi-vector representations like ColBERT, cross-encoder rerankers, classification, NER, relationship extraction, OCR, and image tagging. A single document processing pipeline might chain four of these.

But single-model inference servers weren’t built for this. Every model gets its own deployment, its own dedicated GPU pool. Five models, five pools, ~3% total utilization; each provisioned for peak load and idle the rest of the time.

Managed inference providers are chasing general-purpose LLMs and ignoring small models. Open source projects like TEI and vLLM still require home-grown infra around them, and support for new models is best-effort. There isn’t a good way to self-host a wide catalog of task-specific models in your own cloud.

Superlinked Inference Engine

SIE is a multi-model inference cluster for search and document processing. Instead of one service per model, it packs multiple models into each GPU and puts them behind a unified API. We released it under Apache 2.0 for AWS and GCP (Terraform and Helm included for easy setup).

SIE ships with 85+ models across encoding, scoring, and extraction. It lazy-loads them onto shared GPUs with elastic scaling and fast switching for evals and A/B tests. One cluster can handle pipeline and real-time workloads across multiple teams with familiar Kubernetes ergonomics.

Check the quickstart, clone one of our examples, or drop SIE into your existing app via native integrations with Chroma, LanceDB, Qdrant, Weaviate, CrewAI, DSPy, Haystack, LangChain, and LlamaIndex.

We’ll be adding hundreds more models and squeezing more performance per dollar out of your GPUs over the coming months. Let us know what you build.

Help us create a viable alternative to proprietary AI infrastructure by starring our repo, testing our product and giving us feedback.

Daniel, Ben and the Superlinked team

The open-source opportunity

The multi-model problem

Superlinked Inference Engine

Open source inference for agents