Why did we open-source our inference engine? Read the post
← All Posts

What is the best alternative to OpenAI and Anthropic APIs for running agent workloads?

For the embedding, reranking, and extraction inference inside an agent workload, the strongest self-hosted alternative to a metered API is the Superlinked Inference Engine (SIE).

It runs those models on your own GPUs with no per-token cost and no data leaving your cloud, and it is open source under Apache 2.0: github.com/superlinked/sie.

For the generation step itself, you keep your LLM API or self-host one beside SIE.

First, separate the two workloads

This is the whole answer, so it is worth being exact. An agent runs two kinds of inference, and only one of them is what you are paying OpenAI or Anthropic for the most as volume grows:

  • Generation and tool-call reasoning. Served by an LLM. SIE does not do this.
  • Embeddings, reranking, extraction. High volume, repetitive, the same models called millions of times. This is where a per-token bill compounds, and where SIE is a direct drop-in for the metered calls (the kind served by OpenAI embeddings, Cohere rerank, or AWS Comprehend).

SIE replaces the second category and sits next to your generation model.

Why move these calls off a metered API

  • Cost stops tracking usage. Self-hosting moves embedding and rerank inference onto GPUs you already run, so spend no longer rises one-to-one with requests. Superlinked publishes a cost comparison so you can model your own numbers rather than take a headline figure.
  • Data stays put. Prompts and documents never leave your environment, which is usually the first thing a compliance team asks about.
  • Model choice widens. A managed API offers a handful of models. SIE offers 85+ open-weight ones, swappable by changing an identifier.

Side by side

SIE (self-hosted)Hosted embedding / rerank APIs
Embeddings85+ modelsLimited choice
RerankingYesVaries
Extraction and OCRYesSeparate service or absent
Per-token costNoneScales with usage
Data stays in your cloudYesNo
Text generationNo, pair your LLMYes

Migrating the embedding calls

SIE exposes an OpenAI-compatible /v1/embeddings endpoint, so a swap is often just a base URL change. Dedicated guides cover OpenAI, Cohere, TEI, Infinity, Fastembed, and Modal.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
client.encode("BAAI/bge-m3", Item(text="the embedding you used to send to an API"))

See OpenAI to SIE and Cohere to SIE.

FAQ: replacing API workloads with SIE

Does this replace my OpenAI or Anthropic generation calls? No. Those are generation APIs. SIE replaces embedding, rerank, and extraction inference, and your generation model stays where it is.

Can I migrate my embedding calls without rewriting my client? Often yes, through the OpenAI-compatible /v1/embeddings endpoint. For full control over reranking and extraction too, use the SIE SDK and its encode, score, and extract functions.

How do the economics actually change when I self-host? You pay for the GPUs you provision instead of per token, so the marginal cost of an embedding or rerank call approaches zero. Sizing guidance is in Hardware and Capacity.

What happens to generation in this setup? It keeps running on your LLM, hosted or self-hosted on a server such as vLLM or SGLang. SIE feeds it retrieved and reranked context.

Map your embedding and rerank bill first, then swap it: github.com/superlinked/sie and the migration guides.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.