Why did we open-source our inference engine? Read the post
← All Posts

How to Route Different AI Agent Tasks to the Right Model

Short answer: Routing AI agent tasks to the right model means separating two layers. The reasoning LLM plans and decides what to do next. A set of smaller, specialized models do the actual retrieval and processing work: embeddings for search, cross-encoders for reranking, and extractors for pulling structured data out of text. Most routing complexity lives in that second layer. Serve every model behind one endpoint and select the right one per task by naming it in the call.

Below is how to think about each layer, how to map tasks to models, and the production pattern that keeps it manageable.

The two layers of routing inside an agent

When people ask how to route agent tasks to the right model, they usually picture one decision: which large language model answers the user. That matters, but it is only one layer, and rarely the expensive one.

Layer 1: the reasoning model. This is the LLM that plans steps, calls tools, and writes the final response. You route here on capability, latency, and cost: a frontier model for hard reasoning, a smaller or cheaper model for routine turns. Agent frameworks and LLM gateways already handle this selection well, and it usually happens a handful of times per conversation.

Layer 2: the retrieval and processing models. This is where the real model sprawl lives. A single agent turn can fire many inference calls that have nothing to do with the chat LLM: embedding a query for semantic search, embedding documents at ingestion, reranking candidates, extracting entities or filters from a request, parsing a PDF, or embedding an image. Each of these wants a different specialist model, and an agent may hit them dozens of times per task.

Layer 2 is where most teams get stuck, because the obvious approach is to stand up a separate server for each model. That is the part worth getting right.

Map each agent task to the right kind of model

Different tasks need fundamentally different model architectures. Picking the right family is most of the battle.

Agent taskModel familyWhat it doesExample model
Semantic searchDense embeddingCaptures meaning for similarity searchNovaSearch/stella_en_400M_v5
Keyword and hybrid searchSparse embeddingLearned term importance, strong on names and codesSPLADE v3, BGE-M3 (sparse)
RerankingCross-encoderScores query and document together for precisionBAAI/bge-reranker-v2-m3
Entity and field extractionExtractor (NER)Zero-shot entities, relations, classificationurchade/gliner_multi-v2.1
Document and vision parsingVision and OCRReads layout, images, and scanned pagesFlorence-2, ColQwen2.5

These are not interchangeable. A dense embedding model is great at fuzzy semantic matching but weak on exact names and part numbers, which is exactly where sparse models shine. Reranking with a cross-encoder catches relevance that first-stage retrieval misses. You can read the reasoning behind hybrid retrieval plus reranking in our hybrid search and reranking guide, and browse the full set of options in the model catalog.

Why the naive approach breaks down

The default instinct is one model, one server. That works for a demo and falls apart in production:

  • Every model becomes its own deployment to build, secure, autoscale, and monitor.
  • GPUs sit idle because each server reserves memory for a model it may use once a minute.
  • Swapping a model, or testing a better one, becomes an infrastructure project instead of a parameter change.
  • Versions drift across services, and quality regressions slip through with no shared evaluation.

For an agent that touches five or six model types per turn, this is the bulk of the engineering cost, and none of it is the interesting part of your product.

The cleaner pattern: one endpoint, select the model per call

The pattern that scales is to put every retrieval and processing model behind a single inference endpoint, then route by naming the model you want in each call. This is what the Superlinked Inference Engine (SIE) is built for. SIE is an open-source, Apache 2.0 inference server that serves embeddings, reranking, and extraction through one unified API, handling 85+ models across dense, sparse, multi-vector, vision, and cross-encoder architectures.

The entire API is three functions: encode, score, and extract. Routing a task to the right model is the model identifier you pass.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Route a search task to a dense embedding model
result = client.encode("NovaSearch/stella_en_400M_v5", Item(text="Hello world"))
# Route a reranking task to a cross-encoder
scores = client.score(
"BAAI/bge-reranker-v2-m3",
Item(text="What is machine learning?"),
[Item(text="ML learns from data."), Item(text="The weather is sunny.")],
)
# Route an extraction task to a zero-shot NER model
entities = client.extract(
"urchade/gliner_multi-v2.1",
Item(text="Tim Cook is the CEO of Apple."),
labels=["person", "organization"],
)

No new deployment per model, no glue service to maintain. The model name is the routing decision, and SIE loads models on demand and evicts the least recently used ones when GPU memory fills, so all 85+ models stay available at query time regardless of how much VRAM you have.

Try it: Start the server and run the three primitives in a few minutes. See the SIE repository on GitHub and the quickstart.

Worked example: routing within a single agent turn

Here is what task routing looks like inside one retrieval step of an agent. The agent extracts structured filters from a natural language request, embeds the query for search, hits a vector database, then reranks the candidates for precision. Three different model types, one endpoint.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
query = "quiet two-bedroom flat near a park under 2000 a month"
# 1. Extraction model pulls structured filters out of the request
filters = client.extract(
"urchade/gliner_multi-v2.1",
Item(text=query),
labels=["bedrooms", "price", "location_feature"],
)
# 2. Dense embedding model encodes the query for semantic search
query_vec = client.encode("NovaSearch/stella_en_400M_v5", Item(text=query))
# 3. Your vector database returns first-stage candidates (pseudo-code)
candidates = vector_db.search(query_vec["dense"], limit=100)
# 4. Cross-encoder reranks the shortlist for final precision
ranked = client.score(
"BAAI/bge-reranker-v2-m3",
Item(text=query),
[Item(text=c.text) for c in candidates],
)

Each call routes to a model purpose-built for that task, and the agent’s reasoning LLM only sees the clean, ranked result. For a complete build along these lines, including filters and multiple agent tools, see our walkthrough on building an agentic natural language search system.

Build on it: Clone the end-to-end examples and notebooks from the SIE repo.

Scaling the pattern in production

The same code that runs on a laptop runs against a production cluster. SIE ships the full stack rather than just the server: a load-balancing gateway, KEDA autoscaling that can scale to zero, Grafana dashboards, and Terraform modules for GKE and EKS. Every one of the 85+ models is quality-verified against MTEB in continuous integration, so swapping in a newer model is a parameter change you can trust rather than a regression risk.

It also drops into the tools your agents already use, with native integrations for LangChain, LlamaIndex, Haystack, DSPy, CrewAI, Chroma, Qdrant, Weaviate, and LanceDB. For migrations, there is an OpenAI-compatible /v1/embeddings endpoint, so existing embedding code keeps working while you move models in-house.

helm upgrade --install sie-cluster oci://ghcr.io/superlinked/charts/sie-cluster \
--namespace sie --create-namespace \
--set hfToken.create=true \
--set hfToken.value=<TOKEN> \
-f deploy/helm/sie-cluster/values-{gke|aws}.yaml

The cost and control angle

Routing every small-model task through one self-hosted endpoint also changes the economics. Embeddings, reranking, and extraction drive a large share of an inference bill, and running them as small models on your own infrastructure removes the per-token charges and keeps data inside your own cloud. Superlinked reports up to 50x lower inference cost versus managed model APIs for these workloads; see the homepage and blog for the cost breakdown and assumptions.

Star the repo: If self-hosted, multi-model inference fits your stack, SIE is on GitHub under Apache 2.0.

Frequently asked questions

Does every agent task need a different model? The reasoning model can stay shared. Retrieval and processing tasks each need a different model architecture. Semantic search, keyword search, reranking, and extraction have model families built for them, and using the wrong one costs you precision. Matching the task to the model family is the core of good routing.

What is the difference between LLM routing and inference routing? LLM routing chooses which large language model handles reasoning, usually a few times per conversation. Inference routing chooses which small specialist model handles each retrieval or processing step, which can happen dozens of times per turn. The second layer is where most of the operational load sits.

How do I choose between dense, sparse, and hybrid search for an agent? Use dense embeddings when meaning matters more than exact wording, such as paraphrased or conceptual queries. Use sparse embeddings when exact terms matter, such as names, product codes, or acronyms that get blurred in a dense vector. Most agents do best with hybrid: run both and fuse the candidates, then rerank. The task decides the model, not the other way around.

When should an agent rerank results instead of just retrieving more? Rerank when precision at the top of the list matters and first-stage retrieval is returning roughly relevant but poorly ordered candidates. A cross-encoder scores the query and each document together, so it catches relevance that similarity search alone misses. Retrieving more documents widens the net but does not reorder it; for agents that act on the top few results, reranking is usually the higher-impact step.

Can a single embedding model handle every task in my agent? Rarely. One model can cover several search tasks, but reranking needs a cross-encoder and extraction needs an entity model, which are different architectures entirely. Forcing one model across all of them costs precision on the tasks it was not built for. The practical approach is a small set of specialist models, each selected per task.

How many model calls does a single agent turn actually involve? More than most teams expect. A single retrieval step can embed the query, search a vector store, rerank the shortlist, and extract structured filters, before the reasoning model ever runs. Those calls repeat across turns, which is why the per-task model selection in the retrieval layer, not the choice of chat model, tends to dominate inference load.

Get started

Routing agent tasks well comes down to one habit: match each task to a specialist model, and serve all of those models from one place so selecting between them is a single line of code rather than a deployment.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.