---
title: Boost performance & reduce cost by self-hosting specialized AI models
description: Introducing SIE, a multi-model inference cluster for search and document processing workloads, released under Apache 2.0.
canonical_url: https://superlinked.com/blog/launch
last_updated: 2026-05-20
---

## The open-source opportunity

We've built TB-scale search systems for years, always on open source models. Embeddings, rerankers, classifiers, OCR. No matter the domain or language, a good OSS model either exists or is a fine-tuning run away.

🤗 now adds 100,000 new models each month:

<GrowthChart />

Even the largest proprietary models now have an equally capable open source alternative within months of launch. Yet companies spend a total of tens of billions of dollars on LLM APIs.

## The multi-model problem

Real AI pipelines use many specialized models working together: dense and sparse embeddings, multi-vector representations like ColBERT, cross-encoder rerankers, classification, NER, relationship extraction, OCR, and image tagging. A single document processing pipeline might chain four of these.

But single-model inference servers weren't built for this. Every model gets its own deployment, its own dedicated GPU pool. Five models, five pools, ~3% total utilization; each provisioned for peak load and idle the rest of the time.

Managed inference providers are chasing general-purpose LLMs and ignoring small models. Open source projects like TEI and vLLM still require home-grown infra around them, and support for new models is best-effort. There isn't a good way to self-host a wide catalog of task-specific models in your own cloud.

## Superlinked Inference Engine

SIE is a multi-model inference cluster for search and document processing. Instead of one service per model, it packs multiple models into each GPU and puts them behind a unified API. We released it under Apache 2.0 for AWS and GCP (Terraform and Helm included for easy setup).

SIE ships with 85+ models across encoding, scoring, and extraction. It lazy-loads them onto shared GPUs with elastic scaling and fast switching for evals and A/B tests. One cluster can handle pipeline and real-time workloads across multiple teams with familiar Kubernetes ergonomics.

Check the [quickstart](/docs/quickstart), clone one of our [examples](/docs/examples), or drop SIE into your existing app via native integrations with [Chroma](/docs/integrations/chroma), [LanceDB](/docs/integrations/lancedb), [Qdrant](/docs/integrations/qdrant), [Weaviate](/docs/integrations/weaviate), [CrewAI](/docs/integrations/crewai), [DSPy](/docs/integrations/dspy), [Haystack](/docs/integrations/haystack), [LangChain](/docs/integrations/langchain), and [LlamaIndex](/docs/integrations/llamaindex).

We'll be adding hundreds more models and squeezing more performance per dollar out of your GPUs over the coming months. Let us know what you build.

Help us create a viable alternative to proprietary AI infrastructure by [starring our repo](https://github.com/superlinked/sie), testing our product and giving us feedback.

Daniel, Ben and the Superlinked team
