Why did we open-source our inference engine? Read the post
← All Posts

Boost performance & reduce cost by self-hosting specialized AI models

Boost performance & reduce cost by self-hosting specialized AI models

The open-source opportunity

We’ve built TB-scale search systems for years, always on open source models. Embeddings, rerankers, classifiers, OCR. No matter the domain or language, a good OSS model either exists or is a fine-tuning run away.

πŸ€— now adds 100,000 new models each month:

Number of open-source models on huggingface.com
0 1M 2M 3M 2022 2023 2024 2025 Mar 26 2.745M ColBERTv2 E5 SigLIP BGE-M3 ColPali GOT-OCR2 GLiNER ReLiK ModernColBERT Qwen3 Jina v3 GLM-OCR Nemotron Voxtral

Source: Hugging Face Hub

Even the largest proprietary models now have an equally capable open source alternative within months of launch. Yet companies spend a total of tens of billions of dollars on LLM APIs.

The multi-model problem

Real AI pipelines use many specialized models working together: dense and sparse embeddings, multi-vector representations like ColBERT, cross-encoder rerankers, classification, NER, relationship extraction, OCR, and image tagging. A single document processing pipeline might chain four of these.

But single-model inference servers weren’t built for this. Every model gets its own deployment, its own dedicated GPU pool. Five models, five pools, ~3% total utilization; each provisioned for peak load and idle the rest of the time.

Managed inference providers are chasing general-purpose LLMs and ignoring small models. Open source projects like TEI and vLLM still require home-grown infra around them, and support for new models is best-effort. There isn’t a good way to self-host a wide catalog of task-specific models in your own cloud.

Superlinked Inference Engine

SIE is a multi-model inference cluster for search and document processing. Instead of one service per model, it packs multiple models into each GPU and puts them behind a unified API. We released it under Apache 2.0 for AWS and GCP (Terraform and Helm included for easy setup).

SIE ships with 85+ models across encoding, scoring, and extraction. It lazy-loads them onto shared GPUs with elastic scaling and fast switching for evals and A/B tests. One cluster can handle pipeline and real-time workloads across multiple teams with familiar Kubernetes ergonomics.

Check the quickstart, clone one of our examples, or drop SIE into your existing app via native integrations with Chroma, LanceDB, Qdrant, Weaviate, CrewAI, DSPy, Haystack, LangChain, and LlamaIndex.

We’ll be adding hundreds more models and squeezing more performance per dollar out of your GPUs over the coming months. Let us know what you build.

Help us create a viable alternative to proprietary AI infrastructure by starring our repo, testing our product and giving us feedback.

Daniel, Ben and the Superlinked team

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.