Why did we open-source our inference engine? Read the post
← All Posts

Best OpenAI Embeddings Alternatives You Can Run in Your Own Cloud

Best OpenAI Embeddings Alternatives You Can Run in Your Own Cloud

If you are embedding at any real volume, OpenAI’s text-embedding-3-small, text-embedding-3-large, and the older ada-002 are billed per token and run on OpenAI’s servers. Every document you index and every query you run leaves your cloud, and you pay for each one. For a lot of teams that is fine. For teams with a large corpus, strict data-residency rules, or a growing token bill, it is worth knowing that the open embedding models are now good enough to replace hosted embeddings for most search and RAG work.

There is no single best embedding model. The right pick depends on your language mix, your latency budget, and how much VRAM you want to spend. Here are eight strong options you can run in your own cloud. All eight are open weights and listed in the SIE model catalog. SIE runs MTEB quality evals in CI against saved targets for every supported model, so you can serve any of them behind one API.

What OpenAI gives you, and what it costs

OpenAI embeddings are convenient. One API key, a stable endpoint, and models that produce 1536-dimensional vectors (text-embedding-3-small, ada-002) or 3072-dimensional vectors (text-embedding-3-large), up to 8192 input tokens. You never think about GPUs.

The trade-offs show up at scale. You are billed per token for embeddings and again for anything you rerank through a hosted API, so the bill grows with your corpus rather than with your traffic. Your text has to leave your infrastructure to be embedded, which is a non-starter for some regulated workloads. And you do not control when the model changes underneath you, which means re-embedding your entire index on someone else’s schedule.

Open models remove all three of those. The cost is that you now run the inference yourself, which is exactly the part SIE handles.

1. Qwen3-Embedding

Qwen/Qwen3-Embedding-4B ranks highly on MTEB retrieval tasks and handles many languages. If you want one multilingual encoder to try first, start here. The lighter Qwen/Qwen3-Embedding-0.6B trades some quality for a smaller footprint when throughput or cost matters.

Best for: a high-quality multilingual default. SIE model ID: Qwen/Qwen3-Embedding-4B (or Qwen/Qwen3-Embedding-0.6B).

2. BGE-M3

BAAI/bge-m3 produces dense, sparse, and multi-vector representations from one model, which makes it the natural choice for hybrid retrieval. It is multilingual and handles long inputs. Instead of running a separate model for keyword-style sparse matching and another for dense semantic search, you get both from a single encoder, which simplifies the pipeline considerably.

Best for: hybrid dense-plus-sparse retrieval from one model. SIE model ID: BAAI/bge-m3.

3. Stella v5

NovaSearch/stella_en_1.5B_v5 gives excellent English retrieval quality at a modest size, which makes it a strong middle ground between the tiny models and the leaderboard heavyweights. If your corpus is primarily English and you want quality without a large GPU bill, Stella is a reliable pick.

Best for: high-quality English retrieval at a moderate footprint. SIE model ID: NovaSearch/stella_en_1.5B_v5.

4. NV-Embed-v2

nvidia/NV-Embed-v2 ranks near the top of the MTEB leaderboard on retrieval tasks (see also the model picking guide). It is a large model, so it needs real GPU memory. Use it when retrieval quality matters more than compute budget.

Best for: maximum retrieval quality when GPU is available. SIE model ID: nvidia/NV-Embed-v2.

5. Nomic Embed v2

nomic-ai/nomic-embed-text-v2-moe uses a mixture-of-experts design with fully open training data and long-context support. The open training data is a genuine differentiator for teams that need to know exactly what went into the model, and the long context helps when you embed larger chunks.

Best for: long-context embedding with fully open training data. SIE model ID: nomic-ai/nomic-embed-text-v2-moe.

6. mxbai-embed-large

mixedbread-ai/mxbai-embed-large-v1 is a solid English general-purpose encoder. The model card reports strong MTEB scores across retrieval, clustering, and classification.

Best for: a dependable English general-purpose default. SIE model ID: mixedbread-ai/mxbai-embed-large-v1.

7. Snowflake Arctic-Embed

Snowflake/snowflake-arctic-embed-l-v2.0 is retrieval-tuned and multilingual, and the medium variant (snowflake-arctic-embed-m-v2.0) delivers strong retrieval at a small footprint. It is a good option when you want retrieval-focused quality across languages without paying for a large model.

Best for: retrieval-tuned multilingual search at a small-to-medium size. SIE model ID: Snowflake/snowflake-arctic-embed-l-v2.0 (or the -m- medium variant).

8. EmbeddingGemma

google/embeddinggemma-300m is tiny (300M parameters), which suits on-device, edge, and high-throughput paths where latency and memory matter more than leaderboard rank.

Best for: on-device and high-throughput workloads. SIE model ID: google/embeddinggemma-300m.

Also worth knowing

A few more that did not make the main list but are in the model catalog and worth trying on your own data: Salesforce/SFR-Embedding-2_R is another high-MTEB encoder alongside NV-Embed-v2; the E5 family (intfloat/multilingual-e5-large-instruct, intfloat/e5-large-v2) is a reliable, well-understood baseline; and ibm-granite/granite-embedding-english-r2 is Apache 2.0 and aimed at enterprise use.

On the hosted competitors: Cohere Embed and Voyage are good models, but they are still metered APIs where your data leaves your cloud, so they do not meet the “run it in your own cloud” bar this article is about. If self-hosting is not a requirement for you, they are worth a look.

Rather than trust any single leaderboard number, pick two or three candidates for your language and domain and evaluate them on your own queries. The SIE examples gallery includes a retrieval ablation that ranks several encoder, reranker, and multi-vector pipelines head to head on real queries. Quality in retrieval usually comes from choosing the right model and reranking well, not from picking the largest model.

How to actually run them: SIE

Serving an open embedding model in production usually means standing up a model server, wiring in autoscaling and monitoring, and doing it again for every model you want to try. SIE collapses that into one system. It serves 85+ pre-configured models from one cluster, loads each on demand, and evicts the least recently used one, so a single GPU serves a rotating set of models instead of one per container.

The migration path is short for dense embeddings. SIE exposes an OpenAI-compatible /v1/embeddings endpoint. Point the standard OpenAI client at your SIE server, set any string as the API key, and you get the same response shape. Sparse, multi-vector, reranking, and extraction need SIEClient or a framework adapter instead, and the OpenAI dimensions shortening parameter is ignored (you get each model’s native vector size). See the integrations page and OpenAI migration guide for the full mapping.

from openai import OpenAI
# Was: client = OpenAI(api_key=OPENAI_API_KEY)
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.embeddings.create(
model="Qwen/Qwen3-Embedding-4B",
input="self-hosted embeddings, one URL change",
)
print(len(resp.data[0].embedding))

The same pattern works through LangChain’s OpenAIEmbeddings class:

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(base_url="http://localhost:8080/v1")

For sparse, multi-vector, reranking, or extraction, use the native SIEClient:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.encode("BAAI/bge-m3", Item(text="hybrid retrieval from one model"))

Because SIE runs on your own infrastructure, there is no per-token API bill and embeddings stay on your network. You pay for the compute you provision. The same Docker image runs on a laptop for development and scales to a Kubernetes cluster with a load-balancing gateway, KEDA autoscaling with scale-from-zero, Grafana dashboards, and Terraform modules for GKE and EKS, all Apache 2.0.

Get started

Frequently asked questions

Are open embedding models as good as OpenAI’s?

For retrieval, RAG, classification, and clustering, well-chosen open models are competitive on MTEB retrieval tasks. Evaluate two or three on your own data rather than trusting a single benchmark number.

Can I keep my vector database and framework?

Yes. SIE integrates with LangChain, LlamaIndex, Haystack, DSPy, CrewAI, Chroma, Qdrant, and Weaviate. For dense embeddings, the OpenAI-compatible endpoint usually means a base-URL and model-id change.

How does this change my costs?

You stop paying per token and start paying for the compute you provision. Multiple models share a GPU through on-demand loading, so utilization stays high.

What about Azure OpenAI?

If you already use the standard OpenAI client, point base_url at your SIE cluster and pick a SIE model id. The AzureOpenAI client (deployment names, api_version, Azure-specific auth) does not map one-to-one; switch to the plain OpenAI client or LangChain’s OpenAIEmbeddings with SIE’s URL. See the OpenAI migration guide.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.