Why did we open-source our inference engine? Read the post
← All Posts

One GPU, Four Retrieval Modes: How to Serve Hybrid Search Without Four Separate Deployments

TL;DR: Competitive retrieval in 2026 runs four model types: dense embeddings for semantic matching, sparse vectors for keyword recall, ColBERT for token-level precision, and a cross-encoder reranker for final ordering.

The usual setup gives each its own container and GPU, which leaves most of the hardware idle. In his Berlin Buzzwords 2026 talk, Filip shows how to serve all four from one process and one GPU through a single API, using the open-source Superlinked Inference Engine (SIE). Same server, same GPU, three primitives: encode, score, and extract.

Watch Filip’s full talk, then read on for how to put four retrieval modes on one GPU:

Why does competitive search now need four model types?

Single-vector semantic search is no longer enough on its own. Strong retrieval pipelines in 2026 combine four complementary modes, each covering a weakness of the others:

  • Dense embeddings capture meaning. They compress text into fixed-size vectors (384 to 4096+ dimensions depending on the model) and handle semantic matching, where the query and the document share intent but not words.
  • Sparse vectors capture keywords. They assign weights directly to vocabulary tokens, which preserves exact term matching for product names, proper nouns, and acronyms that dense models tend to blur.
  • ColBERT and multi-vector models capture token-level detail. Late interaction matches at the token level for higher precision than a single pooled vector allows.
  • Cross-encoder rerankers clean up the final order. They re-score the top query-document pairs jointly for a last precision boost before results reach the user.

In the talk, Filip walks through a real pipeline that chains these stages, shows what each one adds to retrieval quality using BEIR benchmark data, and is candid about when the extra complexity is not worth it. The point is not that every system needs all four. It is that competitive search increasingly does, and the infrastructure has to keep up.

Why is the one-container-per-model setup so wasteful?

The infrastructure story is the ugly part. The industry default is one container per model, usually a Hugging Face TEI server, Triton, or a custom Flask wrapper. Four retrieval modes then means four separate deployments, four sets of scaling rules, and four GPU allocations, where each model reserves far more than it uses.

Superlinked’s own framing of the problem is blunt: five models, five dedicated pools, each provisioned for peak load and idle the rest of the time, landing at roughly 3% total utilization. You pay for four GPUs to do the work of a fraction of one, and you carry the operational weight of four services that all have to be scaled, monitored, and upgraded independently.

This is the gap Filip’s talk targets. The retrieval quality story has matured. The serving story has not.

What does serving four retrieval modes from one GPU look like?

SIE takes the opposite approach: one server process that handles all four retrieval modes through a unified API with three primitives, encode, score, and extract. Same server, same GPU, same API.

The key enabler is that several modes collapse into one call. BGE-M3 produces dense, sparse, and multi-vector outputs simultaneously from a single encode request, which the sparse and hybrid docs note is more efficient than calling separate dense and sparse models. Cross-encoder reranking runs through the score primitive. So a hybrid pipeline that used to span four containers becomes a handful of calls against one endpoint:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Dense + sparse from a single encode call (BGE-M3 also supports multi-vector).
result = client.encode(
"BAAI/bge-m3",
Item(text="What is hybrid search?"),
output_types=["dense", "sparse"],
is_query=True,
)
# Combine in your vector DB, e.g. alpha * dense_score + (1 - alpha) * sparse_score
# Rerank the top candidates with a cross-encoder via the score primitive.
ranked = client.score(
"cross-encoder/ms-marco-MiniLM-L-6-v2",
Item(text="What is hybrid search?"),
[Item(text="Hybrid search combines dense and sparse retrieval."),
Item(text="An unrelated passage.")],
)

One process, one GPU, and the dense, sparse, ColBERT, and reranking stages all served from it.

How does one process keep many models efficient?

Packing models onto a GPU is easy to say and hard to do well. The talk covers the three pieces that make it work in production, and each is reflected in the SIE architecture docs.

Adapter architecture, not one runtime. SIE wraps PyTorch, Flash Attention, SGLang, and other backends behind a common interface, and the worker selects the right backend per model. Different architectures need different compute paths, so forcing everything through a single unified runtime would be the wrong call. The adapters are what let a BERT encoder, a BGE-M3 hybrid model, and a cross-encoder share one server.

A shared queue with batching, not per-worker queues. Rather than isolating each model behind its own queue, SIE routes work through a shared queue (NATS JetStream) and forms batches by model and operation before they hit the GPU. The worker sidecar reports queue depth as part of its health signal, which keeps load balanced against hardware capacity instead of letting one model starve the others.

Hot-swapping with LRU eviction. Models load lazily on first request and are evicted least-recently-used when GPU memory fills, so you keep a hot working set rather than pinning every model permanently. Per the deployment docs, an L4 (24GB) keeps 2 to 3 standard models hot at once, while all 85+ supported models stay available at query time regardless of VRAM.

Can I run this in my own cloud?

Yes, and that is much of the point. SIE is Apache 2.0 and ships with Helm charts and Terraform configurations for AWS and GCP, so you can self-host the whole pipeline inside your own VPC. Your documents and queries never leave your cloud, which matters for regulated and security-sensitive workloads.

Observability comes built in. SIE provides Grafana dashboards that track queue depth, autoscaling, and throughput, and it supports scale-from-zero so idle pools cost nothing until traffic arrives. You get the operational visibility of a managed service while keeping the data sovereignty of running it yourself.

Docs: Deployment · AWS · GCP

When should a model still get its own container?

Multi-model serving on one GPU is not a universal answer, and Filip is direct about the tradeoffs. Under heavy concurrent load, models on the same GPU compete for memory, and a latency-critical model running at high, steady QPS can be a better fit for its own dedicated pool where it never has to share. The sweet spot for one-GPU multi-model serving is mixed and pipeline workloads, and the long tail of task-specific models with bursty demand, exactly the shape of a hybrid retrieval stack. The talk shares real data from running these workloads on L4 GPUs to show where each choice wins.

The same infrastructure extends naturally to agents. In agentic workflows, small models do the unglamorous preprocessing: pruning context, extracting metadata, and classifying intent before anything hits a larger and more expensive LLM. SIE serves as the inference engine for those steps, the tool that the agent calls.

Filip’s broader argument is economic and strategic. As open-source models keep narrowing the gap with frontier models, the case for running sovereign AI infrastructure on hardware you control, for cost and for security, keeps getting stronger. Hybrid search is the near-term workload. Owning your inference stack is the longer-term position.

The takeaway

Hybrid retrieval quality has settled into four stages, but most teams still pay for them with four separate deployments and most of a GPU sitting idle. Serving dense, sparse, ColBERT, and reranking from one process on one GPU removes that waste without giving up quality or data control. The models were never the hard part. The serving was.

Browse the 85+ supported models, read the architecture docs, and run it yourself at github.com/superlinked/sie and /docs. For the full pipeline walkthrough and the BEIR and L4 benchmark data, watch Filip’s Berlin Buzzwords 2026 talk above.

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.