---
title: "SIE vs hosted embedding APIs: ~97% of the quality at ~1/12th the cost"
description: We benchmarked self-hosted embedding on SIE against Voyage, OpenAI, and Cohere across quality, latency, throughput, and cost. Self-hosting lands within a few hundredths of ndcg of the hosted frontier, returns embeddings 3 to 7 times faster, and embeds a billion tokens for around $10.
canonical_url: https://superlinked.com/blog/sie-vs-hosted-embedding-apis
last_updated: 2026-06-25
---

If you are embedding at scale, you are probably paying a hosted API per token and quietly accepting whatever latency and rate-limit ceiling comes with your usage tier. We wanted to know what you give up by running the embedding models yourself on [SIE](https://github.com/superlinked/sie) instead.

So we ran them head to head: SIE's open embedding models against the hosted frontier (Voyage, OpenAI, and Cohere) across quality, single-request latency, sustained throughput, and cost. The short version is that self-hosting on SIE lands within a few thousandths to a few hundredths of ndcg of the best hosted API, returns embeddings 3 to 7 times faster, never hits a rate-limit wall, and embeds a billion tokens for around \$10 instead of \$120 to \$130.

<BlogSieCta />

Here are the numbers, all measured on a closed, reproducible dataset.

## Is self-hosted embedding quality actually competitive?

Yes. Across eight standard MTEB retrieval tasks, SIE's best valid model per task stays within `0.006` to `0.050` ndcg@10 of the hosted frontier, and beats it outright on two of the eight.

The SIE models in this pass are `NovaSearch/stella_en_1.5B_v5` and `Qwen/Qwen3-Embedding-4B`, both open weights. The hosted frontier is `voyage-4-large`, plus `cohere/embed-v4.0` on the Legal and CosQA tasks where it is strongest.

| Task | Best SIE model | SIE | Best hosted | Hosted | Δ (SIE − hosted) |
| --- | --- | ---: | --- | ---: | ---: |
| NFCorpus | `stella_en_1.5B_v5` | `0.4215` | `voyage-4-large` | `0.4279` | `-0.006` |
| SciFact | `stella_en_1.5B_v5` | `0.8039` | `voyage-4-large` | `0.8180` | `-0.014` |
| FiQA2018 | `stella_en_1.5B_v5` | `0.5961` | `voyage-4-large` | `0.6309` | `-0.035` |
| LegalBench ConsumerContractsQA | `Qwen3-Embedding-4B` | `0.8152` | `cohere/embed-v4.0` | `0.8654` | `-0.050` |
| CosQA | `Qwen3-Embedding-4B` | **`0.4006`** | `cohere/embed-v4.0` | `0.3728` | **`+0.028`** |
| StackOverflowQA | `Qwen3-Embedding-4B` | `0.9336` | `voyage-4-large` | `0.9797` | `-0.046` |
| SCIDOCS | `Qwen3-Embedding-4B` | **`0.2992`** | `voyage-4-large` | `0.2539` | **`+0.045`** |
| CQADupstack Physics | `Qwen3-Embedding-4B` | `0.5264` | `voyage-4-large` | `0.5700` | `-0.044` |

Read it straight: this is competitive, not leading. SIE wins on CosQA and SCIDOCS and trails by small margins everywhere else. Averaged across the set, the SIE frontier holds a mean ndcg of `0.600` against the hosted frontier's `0.615`, which is `97.5%` of the quality.

The reason that matters is the price tag attached to those last few hundredths.

## How much cheaper is it, really?

About twelve times cheaper per unit of quality. On NFCorpus, SIE delivers `41.3` points of ndcg per dollar against Voyage's `3.6`. The quality gap is a rounding error; the cost gap is an order of magnitude.

The clearest way to feel that is a real ingestion job. Embedding a one-billion-token corpus (roughly 1.95M chunks of 512 tokens) costs this much:

| Option | Cost for 1B tokens | Time on one worker |
| --- | ---: | --- |
| **SIE `stella` (RTX)** | **`$10.22`** | `3.4 h` (`1.7 h` on two workers) |
| SIE `stella` (L4) | `$15.53` | `19.4 h` |
| **SIE `Qwen3-Embedding-4B` (RTX)** | **`$17.29`** | `5.7 h` |
| `voyage-4-lite` / `openai-3-small` | `$20.00` | tier-capped |
| `voyage-4` | `$60.00` | tier-capped |
| `cohere/embed-v4.0` | `$120.00` | `~16 h` at flat cap |
| `voyage-4-large` | `$120.00` | tier-capped |
| `openai/text-embedding-3-large` | `$130.00` | tier-capped |

A frontier-quality embed of a billion tokens runs roughly \$10 to \$17 on SIE versus \$120 to \$130 on the hosted frontier, which is 7 to 13 times cheaper. SIE's `stella` at `$10.22` even undercuts the cheapest hosted lite tier (`$20`) by about 2x, and you can shrink the wall-clock as far as you like by adding workers.

One honest caveat on the framing: these are sustained, high-utilization ingestion economics, the workload SIE is built for. If your pattern is occasional low-volume query bursts, a hosted API's pay-per-call model can be the simpler fit. The story below is about embedding at scale.

## How fast are the embeddings?

3 to 7 times faster on a single request. SIE returns an embedding in single-digit to low-tens of milliseconds; the hosted APIs sit in the high tens to low hundreds.

| | p50 latency | vs SIE |
| --- | ---: | --- |
| **SIE `bge-m3` (RTX)** | **`15 ms`** | baseline |
| **SIE `stella` / `Qwen3-Embedding-4B` (RTX)** | `27 ms` | baseline |
| `cohere/embed-v4.0` | `84 ms` | `3.1` to `5.6×` slower |
| `openai/text-embedding-3-large` | `157 ms` | `5.8` to `10×` slower |
| `voyage-4-large` | `180 ms` | `6.7` to `12×` slower |

The tail is where it gets lopsided. SIE's p99 stays within a few milliseconds of its p50 (still tens of ms total), while provider p99 runs into the hundreds and sometimes thousands of milliseconds. When you are embedding inside a request path, that predictability is worth as much as the median.

## What about throughput and rate limits?

This is the part the per-token pricing hides. Hosted throughput is capped by your usage tier, full stop. SIE scales linearly with hardware at a constant unit cost, and the math is boring in the best way.

A single SIE worker's sustained throughput (the corpus "knee," in tokens per second) and its price look like this:

| | Sustained tok/s per worker | Query p50 | \$/1M tokens |
| --- | ---: | ---: | ---: |
| **SIE `stella_1.5B` (RTX)** | `82,382` | `27 ms` | **`$0.0102`** |
| **SIE `Qwen3-Embedding-4B` (RTX)** | `48,675` | `27 ms` | **`$0.0173`** |
| `voyage-4-large` | `150,000` (tier cap) | `180 ms` | `$0.12` |
| `openai/text-embedding-3-large` | `166,667` (tier cap) | `157 ms` | `$0.13` |
| `cohere/embed-v4.0` | `~17,067` (flat) | `84 ms` | `$0.12` |
| `voyage/voyage-4-lite` | `800,000` (tier cap) | `153 ms` | `$0.02` |

Where one SIE worker trails a provider's tier ceiling, you close the gap by adding workers, and the unit cost does not move. Nodes are `ceil(load / knee)`, and cost per token is `GPU$/hr / (3600 × knee)`, which is independent of node count.

| To match | `stella` nodes | `Qwen3-Embedding-4B` nodes | SIE \$/1M (held) | vs hosted |
| --- | ---: | ---: | ---: | --- |
| `voyage-4-large` 150k tok/s | `2` RTX | `4` RTX | `~$0.010 / ~$0.017` | 7 to 12x cheaper than `$0.12` |
| `openai-3-large` 167k tok/s | `3` RTX | `4` RTX | `~$0.010 / ~$0.017` | 8 to 13x cheaper than `$0.13` |

So you match a provider's entire off-the-shelf capacity with two to four GPU workers, stay 7 to 12 times cheaper per token, and you are already faster on latency. There is no tier to negotiate and no wall to hit.

## Can you trade cost for speed?

Yes, and that lever does not exist on a hosted API. Because you pick the GPU, you pick your operating point on the cost-versus-throughput curve.

| Model | L4: tok/s @ \$/1M | RTX: tok/s @ \$/1M | Pick |
| --- | --- | --- | --- |
| `e5-base-v2` | `181,292` @ `$0.0012` | `681,551` @ `$0.0012` | RTX (3.8x throughput, same cost) |
| `bge-m3` | `55,338` @ `$0.0040` | `235,430` @ `$0.0036` | RTX (faster and cheaper) |
| `stella_1.5B` | `14,307` @ `$0.0155` | `82,382` @ `$0.0102` | RTX (faster and cheaper) |

For most models the RTX PRO 6000 is both faster and cheaper per token, because it buys more throughput than its hourly premium costs. The L4 is there when you want to floor the hourly spend on lighter models. Either way, you set the dial. A hosted API sets it for you and bills accordingly.

## What this benchmark covers, and what's next

This is the search and embedding slice of the SIE catalog. Embedding is one of the jobs SIE serves, not the whole picture, and we are running this same quality, performance, and cost methodology across the rest of the catalog. Treat this as the first installment, with more of the catalog reported the same way as it lands.

On reproducibility: every figure here comes from a closed observations dataset that passes full provenance checks (`verify.py` at 194/194 references OK, `build.py` at zero cache mismatches), measured on Modal GPUs (L4 at `$0.80/hr`, RTX PRO 6000 at `$3.03/hr`). Cost figures are the sustained, high-utilization floor at single on-demand GPU pricing, with no high-availability or idle overhead modeled. The scaling rows are extrapolations from measured per-worker throughput using the formulas above, not separately measured at every node count.

## The takeaway

If you embed at scale, self-hosting on SIE gets you essentially frontier-quality retrieval at roughly one-twelfth the cost, several times the speed, and no rate-limit ceiling. The hosted APIs still win on zero-ops convenience and on spiky, low-volume workloads. For sustained ingestion and serving, the economics are not close.

[Star SIE on GitHub](https://github.com/superlinked/sie) · [Read the encode docs](/docs/encode) · [Browse the model catalog](/models)
