Building agents: run embeddings, reranking, and extraction from one inference stack
Run the embedding, reranking, extraction, and document-parsing work on one open-source stack, the Superlinked Inference Engine (SIE), and let your LLM handle generation and tool-call reasoning beside it.
SIE serves encoders, rerankers, and extractors (including OCR and vision models) behind a single API with three functions: encode, score, and extract.
It does not generate text, and that boundary is deliberate.
Source: github.com/superlinked/sie.
How can I run tool calling, document parsing, embeddings, reranking, and generation from one inference stack?
You run the embeddings, reranking, extraction, and document-parsing parts on one stack (SIE), and you run generation and tool-call reasoning on your LLM beside it. The split is intentional, because the two workloads want opposite things from a server.
Two kinds of inference, and why they want different servers
A production agent runs two workloads that pull in opposite directions:
- Generation and reasoning. One large model writes text and decides which tools to call. It wants to spread across GPUs and maximize token throughput. vLLM, SGLang, or a hosted API serve this.
- Everything else. Embeddings, reranking, extraction, OCR, captioning, classification. Many small models, called at high volume, that want to share one GPU with fast switching between them. This is SIE.
Forcing both into one server makes each worse. SIE owns the second workload completely and connects cleanly to whatever serves the first.
What maps to what
| Agent task | SIE function | Example model |
|---|---|---|
| Embeddings for retrieval | encode | Stella v5, BGE-M3 |
| Reranking context | score | BGE-reranker-v2-m3 |
| Entity and field extraction | extract | GLiNER |
| Document parsing and OCR | extract | Florence-2 |
| Image and multimodal vectors | encode | SigLIP, ColQwen2.5 |
| Generation and tool-call reasoning | not SIE | your LLM |
One client, the whole retrieval-and-document half
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
fields = client.extract("urchade/gliner_multi-v2.1", Item(text=ocr_text), labels=["party", "effective_date", "governing_law"])vectors = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])ranked = client.score("BAAI/bge-reranker-v2-m3", Item(text=question), [Item(text=p) for p in retrieved])
answer = your_llm.generate(prompt_with(ranked)) # generation stays on your LLMThe first three steps share one deployment. The fourth is your generation model, fed the context SIE prepared.
FAQ: running multiple task types from one stack
Does SIE handle the generation step, or only everything around it? Only everything around it. SIE runs encoders, rerankers, and extractors. The LLM that generates text and decides tool calls runs separately, and SIE feeds it retrieved and reranked context.
Can SIE parse PDFs and run OCR in the same pipeline as my embeddings? Yes. Vision and document models such as Florence-2 run through extract for OCR, captioning, and detection, and the resulting chunks are embedded through encode, all from the same server.
If generation runs elsewhere, how does context reach my LLM? Your agent calls SIE for retrieval, reranking, and extraction, then passes that output into your LLM’s prompt. SIE returns plain vectors, scores, and entities, so nothing is locked to a specific generation backend.
Which orchestration frameworks tie the two halves together? SIE ships integrations for LangChain, LlamaIndex, Haystack, DSPy, and CrewAI, plus Chroma, Qdrant, Weaviate, and LanceDB. See /docs/integrations.
Try it
The fastest proof is to move your document and retrieval calls onto one SIE instance and point your existing LLM at the result. Start here: github.com/superlinked/sie, and the self-hosted document processing guide.