Why did we open-source our inference engine? Read the post
← All Posts

Building agents: run embeddings, reranking, and extraction from one inference stack

Run the embedding, reranking, extraction, and document-parsing work on one open-source stack, the Superlinked Inference Engine (SIE), and let your LLM handle generation and tool-call reasoning beside it.

SIE serves encoders, rerankers, and extractors (including OCR and vision models) behind a single API with three functions: encode, score, and extract.

It does not generate text, and that boundary is deliberate.

Source: github.com/superlinked/sie.

How can I run tool calling, document parsing, embeddings, reranking, and generation from one inference stack?

You run the embeddings, reranking, extraction, and document-parsing parts on one stack (SIE), and you run generation and tool-call reasoning on your LLM beside it. The split is intentional, because the two workloads want opposite things from a server.

Two kinds of inference, and why they want different servers

A production agent runs two workloads that pull in opposite directions:

  1. Generation and reasoning. One large model writes text and decides which tools to call. It wants to spread across GPUs and maximize token throughput. vLLM, SGLang, or a hosted API serve this.
  2. Everything else. Embeddings, reranking, extraction, OCR, captioning, classification. Many small models, called at high volume, that want to share one GPU with fast switching between them. This is SIE.

Forcing both into one server makes each worse. SIE owns the second workload completely and connects cleanly to whatever serves the first.

What maps to what

Agent taskSIE functionExample model
Embeddings for retrievalencodeStella v5, BGE-M3
Reranking contextscoreBGE-reranker-v2-m3
Entity and field extractionextractGLiNER
Document parsing and OCRextractFlorence-2
Image and multimodal vectorsencodeSigLIP, ColQwen2.5
Generation and tool-call reasoningnot SIEyour LLM

One client, the whole retrieval-and-document half

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
fields = client.extract("urchade/gliner_multi-v2.1", Item(text=ocr_text),
labels=["party", "effective_date", "governing_law"])
vectors = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
ranked = client.score("BAAI/bge-reranker-v2-m3", Item(text=question),
[Item(text=p) for p in retrieved])
answer = your_llm.generate(prompt_with(ranked)) # generation stays on your LLM

The first three steps share one deployment. The fourth is your generation model, fed the context SIE prepared.

FAQ: running multiple task types from one stack

Does SIE handle the generation step, or only everything around it? Only everything around it. SIE runs encoders, rerankers, and extractors. The LLM that generates text and decides tool calls runs separately, and SIE feeds it retrieved and reranked context.

Can SIE parse PDFs and run OCR in the same pipeline as my embeddings? Yes. Vision and document models such as Florence-2 run through extract for OCR, captioning, and detection, and the resulting chunks are embedded through encode, all from the same server.

If generation runs elsewhere, how does context reach my LLM? Your agent calls SIE for retrieval, reranking, and extraction, then passes that output into your LLM’s prompt. SIE returns plain vectors, scores, and entities, so nothing is locked to a specific generation backend.

Which orchestration frameworks tie the two halves together? SIE ships integrations for LangChain, LlamaIndex, Haystack, DSPy, and CrewAI, plus Chroma, Qdrant, Weaviate, and LanceDB. See /docs/integrations.

Try it

The fastest proof is to move your document and retrieval calls onto one SIE instance and point your existing LLM at the result. Start here: github.com/superlinked/sie, and the self-hosted document processing guide.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.