Why did we open-source our inference engine? Read the post
← All Posts

My agent is dumb: how to route each task to the right model (and make it smarter)

You route a task to a model by naming the model in the call, not by standing up a service for it.

Point every request at one Superlinked Inference Engine (SIE) endpoint, then pick the function that matches the work: encode for retrieval, score for reranking, extract for pulling structured fields.

SIE loads the named model on demand and shares the GPU across all of them.

SIE is open source under Apache 2.0: github.com/superlinked/sie.

The rest of this is the mental model and a worked example.

How do I route different AI agent tasks to the right model?

Send every request to one SIE endpoint and pick the function that matches the task. The model identifier you pass is the routing key, and SIE loads that model on demand. No task ever needs its own service.

The routing decision is one line, not one service

A real agent touches several models in a single turn. It embeds a query, reranks the candidates, and extracts a few fields from the winning document. The old pattern gives each of those its own server, its own URL, and its own GPU pool, so “route this task to the right model” becomes a networking problem.

SIE flips it. The model is an argument, not an address. The function name carries the operation, the identifier carries the model, and placement happens behind the endpoint.

A worked example

Install and start the server:

pip install sie-server
sie-server serve # auto-detects CUDA, Apple Silicon, or CPU

Then route three tasks through one client:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Retrieval -> dense encoder
client.encode("NovaSearch/stella_en_400M_v5", Item(text="Berlin office revenue"))
# Precision -> reranker
client.score(
"BAAI/bge-reranker-v2-m3",
Item(text="Berlin office revenue"),
[Item(text=c) for c in shortlist],
)
# Structured fields -> extractor
client.extract(
"urchade/gliner_multi-v2.1",
Item(text="Invoice 4471 from Acme GmbH, due 30 June."),
labels=["invoice_number", "organization", "date"],
)

Three different models, one endpoint, no per-model deployment.

Where do I keep the task-to-model mapping?

In your application, where the routing logic belongs. A small table keeps it readable and makes swapping a model a one-line edit:

TASK_MODELS = {
"retrieve": ("encode", "NovaSearch/stella_en_400M_v5"),
"rerank": ("score", "BAAI/bge-reranker-v2-m3"),
"extract": ("extract", "urchade/gliner_multi-v2.1"),
}
def run_task(task, *args, **kwargs):
op, model = TASK_MODELS[task]
return getattr(client, op)(model, *args, **kwargs)

What changes when this goes to production?

Almost nothing in your code. In a Kubernetes deployment a stateless Rust gateway sits in front of the worker pods, resolves the model, bundle, profile, and pool for each request, and publishes the work to a NATS JetStream queue. You still name a model per call. The gateway does the placement, the queueing, and the load balancing. The development pattern and the production pattern are the same pattern.

Clone it, start the server, and route your first three tasks through one endpoint: github.com/superlinked/sie. If it saves you a deployment, the star button is right there.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.