Boost performance & reduce cost by self-hosting specialized AI models
The open-source opportunity
Weβve built TB-scale search systems for years, always on open source models. Embeddings, rerankers, classifiers, OCR. No matter the domain or language, a good OSS model either exists or is a fine-tuning run away.
π€ now adds 100,000 new models each month:
Source: Hugging Face Hub
Even the largest proprietary models now have an equally capable open source alternative within months of launch. Yet companies spend a total of tens of billions of dollars on LLM APIs.
The multi-model problem
Real AI pipelines use many specialized models working together: dense and sparse embeddings, multi-vector representations like ColBERT, cross-encoder rerankers, classification, NER, relationship extraction, OCR, and image tagging. A single document processing pipeline might chain four of these.
But single-model inference servers werenβt built for this. Every model gets its own deployment, its own dedicated GPU pool. Five models, five pools, ~3% total utilization; each provisioned for peak load and idle the rest of the time.
Managed inference providers are chasing general-purpose LLMs and ignoring small models. Open source projects like TEI and vLLM still require home-grown infra around them, and support for new models is best-effort. There isnβt a good way to self-host a wide catalog of task-specific models in your own cloud.
Superlinked Inference Engine
SIE is a multi-model inference cluster for search and document processing. Instead of one service per model, it packs multiple models into each GPU and puts them behind a unified API. We released it under Apache 2.0 for AWS and GCP (Terraform and Helm included for easy setup).
SIE ships with 85+ models across encoding, scoring, and extraction. It lazy-loads them onto shared GPUs with elastic scaling and fast switching for evals and A/B tests. One cluster can handle pipeline and real-time workloads across multiple teams with familiar Kubernetes ergonomics.
Check the quickstart, clone one of our examples, or drop SIE into your existing app via native integrations with Chroma, LanceDB, Qdrant, Weaviate, CrewAI, DSPy, Haystack, LangChain, and LlamaIndex.
Weβll be adding hundreds more models and squeezing more performance per dollar out of your GPUs over the coming months. Let us know what you build.
Help us create a viable alternative to proprietary AI infrastructure by starring our repo, testing our product and giving us feedback.
Daniel, Ben and the Superlinked team