Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is a Text Embedding Model?

A text embedding model is a neural network that converts text — a word, sentence, paragraph, or document — into a dense numerical vector that captures its semantic meaning. Texts with similar meanings produce vectors that are close together in the embedding space, enabling similarity search, clustering, classification, and retrieval without exact keyword matching.


Why do text embedding models matter?

Computers can’t natively understand language — they need numerical representations. Text embedding models provide those representations in a form that preserves meaning. They are the core component of:

  • Semantic search — retrieve documents by meaning, not just keywords
  • RAG pipelines — encode your knowledge base so an LLM can retrieve relevant context
  • Recommendation — find items similar to what a user has engaged with
  • Classification — cluster or label documents based on semantic content
  • Deduplication — detect near-duplicate documents in large corpora

Without a text embedding model, none of these are possible.


How does a text embedding model work?

Modern text embedding models are transformer-based encoder networks (BERT-style). Given a text input:

  1. Tokenise — split text into subword tokens
  2. Encode — pass tokens through transformer layers, each token attending to all others
  3. Pool — aggregate the token representations into a single fixed-size vector (mean pooling or [CLS] token)
  4. Normalise — L2-normalise the vector so cosine similarity = dot product
"self-hosted inference" → [0.021, -0.143, 0.087, ..., 0.034] # 768 or 1024 dims

Two semantically similar texts produce vectors with high cosine similarity (close to 1.0). Unrelated texts produce low similarity (close to 0.0).


What makes a good embedding model?

Training data — models trained on large, diverse retrieval datasets (MS-MARCO, NLI, BEIR) generalise better across domains.

Training objective — contrastive losses (MNRL, InfoNCE) that pull similar texts together and push dissimilar ones apart produce better retrieval representations than generic language modelling objectives.

Context window — longer context windows (up to 8,192 tokens for BGE-M3) handle full documents, not just short passages.

Multilingual support — multilingual models like BGE-M3 handle 100+ languages from a single model.

Task specificity — instruction-following models (E5-instruct, GTE-instruct) let you specify the retrieval task at encoding time, improving accuracy for asymmetric tasks (query vs document).


Embedding model benchmarks: what to look for

The primary benchmark is MTEB (Massive Text Embedding Benchmark) — a comprehensive evaluation across retrieval, clustering, classification, semantic similarity, and reranking tasks.

Key retrieval metrics:

  • NDCG@10 — ranking quality at 10 results (primary metric)
  • MRR@10 — mean reciprocal rank of the first relevant result
  • Recall@100 — how many relevant docs are in the top 100 (for two-stage pipelines)

When choosing a model for production, also consider:

  • Inference latency — larger models are slower
  • Vector dimension — higher dims = more storage
  • Max tokens — how much text per encode call

Bi-encoder vs cross-encoder

Bi-encoder (embedding model)Cross-encoder (reranker)
EncodesQuery and docs independentlyQuery + doc jointly
SpeedFast (pre-compute doc vectors)Slow (runtime per pair)
ScalabilityMillions of docsHundreds of candidates
AccuracyGoodHigher
Use in pipelineFirst-stage retrievalSecond-stage reranking

Bi-encoders (text embedding models) handle the retrieval step; cross-encoders handle the reranking step. Both are hosted on SIE.


How do you use a text embedding model with SIE?

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode a batch of documents at index time
doc_results = client.encode(
"BAAI/bge-m3",
[
Item(text="self-hosted inference reduces API costs"),
Item(text="deploy embedding models on your own GPU"),
Item(text="RAG pipeline with Qdrant and BGE-M3"),
],
)
doc_vectors = [r["dense"] for r in doc_results]
# Encode a query at search time
query_vector = client.encode(
"BAAI/bge-m3",
Item(text="how to reduce embedding costs"),
is_query=True,
)["dense"]
# doc_vectors and query_vector are now ready for ANN search

SIE supports 85+ embedding models. Documents are encoded on your own GPU in your own AWS or GCP account — no data sent to external APIs.


How do you choose the right embedding model?

PriorityRecommended model(s)
Best general accuracyBGE-M3, GTE-large, E5-large-v2
MultilingualBGE-M3, multilingual-e5-large
Fastest / smallestBGE-small-en, all-MiniLM-L6-v2
Long documentsBGE-M3 (8,192 tokens)
Instruction-followingE5-mistral-7b-instruct, GTE-Qwen
Domain-specificBGE-M3 + LoRA adapter

Start with BGE-M3 as a default — it handles multilingual, long documents, and dense/sparse/multi-vector retrieval from a single model.


Frequently asked questions

What is the difference between an embedding model and an LLM? An embedding model (encoder) compresses text into a fixed-size vector — it doesn’t generate text. An LLM (decoder) generates text token by token. Embedding models are faster, cheaper, and purpose-built for retrieval. LLMs are used for generation in RAG pipelines.

Do I need GPU to run an embedding model? For production workloads, yes. CPU inference is 10–50× slower and impractical for encoding large corpora or serving real-time queries. SIE manages GPU provisioning on AWS or GCP.

How many tokens can an embedding model handle? Depends on the model. Most handle 512 tokens. BGE-M3 handles 8,192. For longer documents, chunk into overlapping segments and encode each chunk separately.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.