Why did we open-source our inference engine? Read the post
← All Posts

Boost performance & reduce cost by self-hosting specialized AI models

Boost performance & reduce cost by self-hosting specialized AI models

The open-source opportunity

We’ve built TB-scale search systems for years, always on open source models. Embeddings, rerankers, classifiers, OCR. No matter the domain or language, a good OSS model either exists or is a fine-tuning run away.

πŸ€— now adds 100,000 new models each month:

Number of open-source models on huggingface.com
0 1M 2M 3M 2022 2023 2024 2025 Mar 26 2.745M ColBERTv2 E5 SigLIP BGE-M3 ColPali GOT-OCR2 GLiNER ReLiK ModernColBERT Qwen3 Jina v3 GLM-OCR Nemotron Voxtral

Source: Hugging Face Hub

Even the largest proprietary models now have an equally capable open source alternative within months of launch. Yet most companies still spend billions on LLM APIs instead of running their own.

The multi-model problem

Real AI pipelines use many specialized models working together: dense and sparse embeddings, multi-vector representations like ColBERT, vision models like SigLIP, cross-encoder rerankers, NER, classification, OCR. A single document processing pipeline might chain four of these.

But single-model inference servers weren’t built for this. Every model gets its own deployment, its own dedicated GPU pool. Five models, five pools, ~3% total utilization; each provisioned for peak load and idle the rest of the time.

Managed inference providers are chasing general-purpose LLMs and ignoring small models. Open source projects like TEI and vLLM still require home-grown infra around them, and support for new models is best-effort. There hasn’t been a good way to self-host hundreds of task-specific models in your own cloud.

Superlinked Inference Engine

SIE is a multi-model inference cluster for search and document processing. Instead of one service per model, it runs all of them behind a single API. We released it under Apache 2.0 for AWS and GCP (Terraform and Helm included for easy setup).

SIE ships with 85+ models across encoding, scoring, and extraction. It lazy-loads them onto shared GPUs with elastic scaling and fast switching for evals and A/B tests. One cluster can handle pipeline and real-time workloads across multiple teams with familiar Kubernetes ergonomics.

Check the quickstart, clone one of our examples, or drop SIE into your existing app via native integrations with Chroma, Weaviate, LangChain, LlamaIndex, DSPy, Haystack, and CrewAI.

We’ll be adding hundreds more models and squeezing more performance per dollar out of your GPUs over the coming months. Let us know what you build.

Let’s make 2026 the year of open source AI!

Daniel, Ben and the Superlinked team

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github
1.5K

Contact us

Tell us about your use case and we'll get back to you shortly.