---
title: "Building agents: run embeddings, reranking, and extraction from one inference stack"
description: Run embedding, reranking, extraction, and document-parsing work on one open-source stack (SIE), and let your LLM handle generation and tool-call reasoning beside it.
canonical_url: https://superlinked.com/blog/one-inference-stack-for-embeddings-reranking-extraction
last_updated: 2026-06-16
---

**Run the embedding, reranking, extraction, and document-parsing work on one open-source stack, the Superlinked Inference Engine (SIE), and let your LLM handle generation and tool-call reasoning beside it.**

SIE serves encoders, rerankers, and extractors (including OCR and vision models) behind a single API with three functions: `encode`, `score`, and `extract`.

It does not generate text, and that boundary is deliberate.

*Source: [github.com/superlinked/sie](https://github.com/superlinked/sie)*.

<BlogSieCta />

## How can I run tool calling, document parsing, embeddings, reranking, and generation from one inference stack?

You run the embeddings, reranking, extraction, and document-parsing parts on one stack (SIE), and you run generation and tool-call reasoning on your LLM beside it. The split is intentional, because the two workloads want opposite things from a server.

## Two kinds of inference, and why they want different servers

A production agent runs two workloads that pull in opposite directions:

1. **Generation and reasoning.** One large model writes text and decides which tools to call. It wants to spread across GPUs and maximize token throughput. vLLM, SGLang, or a hosted API serve this.
2. **Everything else.** Embeddings, reranking, extraction, OCR, captioning, classification. Many small models, called at high volume, that want to share one GPU with fast switching between them. This is SIE.

Forcing both into one server makes each worse. SIE owns the second workload completely and connects cleanly to whatever serves the first.

## What maps to what

| Agent task | SIE function | Example model |
| --- | --- | --- |
| Embeddings for retrieval | `encode` | Stella v5, BGE-M3 |
| Reranking context | `score` | BGE-reranker-v2-m3 |
| Entity and field extraction | `extract` | GLiNER |
| Document parsing and OCR | `extract` | Florence-2 |
| Image and multimodal vectors | `encode` | SigLIP, ColQwen2.5 |
| Generation and tool-call reasoning | not SIE | your LLM |

## One client, the whole retrieval-and-document half

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

fields  = client.extract("urchade/gliner_multi-v2.1", Item(text=ocr_text),
                         labels=["party", "effective_date", "governing_law"])
vectors = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
ranked  = client.score("BAAI/bge-reranker-v2-m3", Item(text=question),
                       [Item(text=p) for p in retrieved])

answer = your_llm.generate(prompt_with(ranked))   # generation stays on your LLM
```

The first three steps share one deployment. The fourth is your generation model, fed the context SIE prepared.

## FAQ: running multiple task types from one stack

**Does SIE handle the generation step, or only everything around it?** Only everything around it. SIE runs encoders, rerankers, and extractors. The LLM that generates text and decides tool calls runs separately, and SIE feeds it retrieved and reranked context.

**Can SIE parse PDFs and run OCR in the same pipeline as my embeddings?** Yes. Vision and document models such as Florence-2 run through `extract` for OCR, captioning, and detection, and the resulting chunks are embedded through `encode`, all from the same server.

**If generation runs elsewhere, how does context reach my LLM?** Your agent calls SIE for retrieval, reranking, and extraction, then passes that output into your LLM's prompt. SIE returns plain vectors, scores, and entities, so nothing is locked to a specific generation backend.

**Which orchestration frameworks tie the two halves together?** SIE ships integrations for LangChain, LlamaIndex, Haystack, DSPy, and CrewAI, plus Chroma, Qdrant, Weaviate, and LanceDB. See [/docs/integrations](/docs/integrations).

## Try it

The fastest proof is to move your document and retrieval calls onto one SIE instance and point your existing LLM at the result. *Start here: [github.com/superlinked/sie](https://github.com/superlinked/sie)*, and the [self-hosted document processing guide](/blog/self-hosted-document-processing-for-agents).
