Why did we open-source our inference engine? Read the post
← All Posts

What small open source models can handle real AI agent tasks?

Small open-source models in the 100M to 1B parameter range already handle most of the inference an agent runs around its main LLM: embeddings, reranking, entity extraction, OCR, and multimodal search.

They fit on a single GPU and rival paid APIs on these tasks.

The Superlinked Inference Engine (SIE) ships 85+ of them pre-configured, each quality-verified against MTEB in CI, so any one is a single call away: github.com/superlinked/sie.

The work that scales with agent usage is rarely generation.

It is the repetitive, high-volume inference: embedding every chunk, reranking every retrieval, extracting fields from every document.

Those are exactly the jobs small specialized models do well.

Below is a working shortlist by task, with the model identifiers you pass to SIE.

Retrieval and embeddings

  • Stella v5 (NovaSearch/stella_en_400M_v5): a 400M dense encoder, strong general-purpose embeddings for semantic search and RAG.
  • BGE-M3 (BAAI/bge-m3): dense, sparse, and multi-vector output from one checkpoint, useful for hybrid retrieval without running three models.
  • all-MiniLM-L6-v2 (sentence-transformers/all-MiniLM-L6-v2): small and fast, comfortable on CPU for local or low-volume work.
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
print(client.encode("NovaSearch/stella_en_400M_v5", Item(text="Hello world"))["dense"].shape)
# (1024,)

Reranking

  • BGE-reranker-v2-m3 (BAAI/bge-reranker-v2-m3): a cross-encoder that scores query and document pairs directly, lifting precision before context reaches the LLM. Call it with score.

Extraction and structured fields

  • GLiNER (urchade/gliner_multi-v2.1): zero-shot named-entity recognition. You pass the labels you want at query time, with no training data, which suits agents that pull fields from arbitrary text. Call it with extract.

Documents, OCR, and vision

  • Florence-2: compact vision model for OCR, captioning, and detection, for agents that read PDFs and scans.
  • SigLIP: image and text embeddings for multimodal search.
  • ColQwen2.5: multi-vector, ColBERT-style retrieval over visual documents.

Why not one server per model?

Because a single agent turn might chain four of these, and the classic pattern gives each its own GPU pool. SIE packs many onto a shared GPU with on-demand loading and least-recently-used eviction. An L4 with 24GB keeps two to three standard models hot at once, while all 85+ stay reachable at query time. Trying a newer open-weight model is a one-line identifier change, never a new deployment.

FAQ: choosing and running small models

Are these accurate enough to replace a large model for these tasks? For embeddings, reranking, and extraction, yes. These are mature open-weight categories, and SIE checks each supported model against MTEB quality targets in CI rather than asking you to take accuracy on faith.

What GPU do I need to run a few at once? An L4 with 24GB holds two to three standard models hot simultaneously. The rest of the catalog loads on demand and evicts the least-recently-used model under memory pressure, so VRAM bounds concurrency, not catalog size.

How do I pick between two encoders for the same task? Benchmark them on your own data with the same SIE call and compare. The model selection guide at /docs/choosing walks through the tradeoffs.

Can I serve my own fine-tuned small model, not just the catalog? Yes. Register it against the running cluster through the config service and call it by identifier like any other model.

Browse the full catalog at /models, or clone the engine and call your first model in two minutes: github.com/superlinked/sie.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.