Models

What is a Text Embedding Model?

A text embedding model is a neural network that converts text (a word, sentence, paragraph, or document) into a dense numerical vector that captures its semantic meaning. Texts with similar meanings produce vectors that are close together in the embedding space, enabling similarity search, clustering, classification, and retrieval without exact keyword matching.

Why do text embedding models matter?

Computers can’t natively understand language; they need numerical representations. Text embedding models provide those representations in a form that preserves meaning. They are the core component of:

Semantic search: retrieve documents by meaning, not just keywords
RAG pipelines: encode your knowledge base so an LLM can retrieve relevant context
Recommendation: find items similar to what a user has engaged with
Classification: cluster or label documents based on semantic content
Deduplication: detect near-duplicate documents in large corpora

Without a text embedding model, none of these are possible.

How does a text embedding model work?

Modern text embedding models are transformer-based encoder networks (BERT-style). Given a text input:

Tokenise: split text into subword tokens
Encode: pass tokens through transformer layers, each token attending to all others
Pool: aggregate the token representations into a single fixed-size vector (mean pooling or [CLS] token)
Normalise: L2-normalise the vector so cosine similarity = dot product

"self-hosted inference" → [0.021, -0.143, 0.087, ..., 0.034]  # 768 or 1024 dims

Two semantically similar texts produce vectors with high cosine similarity (close to 1.0). Unrelated texts produce low similarity (close to 0.0).

What makes a good embedding model?

Training data: models trained on large, diverse retrieval datasets (MS-MARCO, NLI, BEIR) generalise better across domains.

Training objective: contrastive losses (MNRL, InfoNCE) that pull similar texts together and push dissimilar ones apart produce better retrieval representations than generic language modelling objectives.

Context window: longer context windows (up to 8,192 tokens for BGE-M3) handle full documents, not just short passages.

Multilingual support: multilingual models like BGE-M3 handle 100+ languages from a single model.

Task specificity: instruction-following models (E5-instruct, GTE-instruct) let you specify the retrieval task at encoding time, improving accuracy for asymmetric tasks (query vs document).

Embedding model benchmarks: what to look for

The primary benchmark is MTEB (Massive Text Embedding Benchmark), a comprehensive evaluation across retrieval, clustering, classification, semantic similarity, and reranking tasks.

Key retrieval metrics:

NDCG@10: ranking quality at 10 results (primary metric)
MRR@10: mean reciprocal rank of the first relevant result
Recall@100: how many relevant docs are in the top 100 (for two-stage pipelines)

When choosing a model for production, also consider:

Inference latency: larger models are slower
Vector dimension: higher dims = more storage
Max tokens: how much text per encode call

Bi-encoder vs cross-encoder

	Bi-encoder (embedding model)	Cross-encoder (reranker)
Encodes	Query and docs independently	Query + doc jointly
Speed	Fast (pre-compute doc vectors)	Slow (runtime per pair)
Scalability	Millions of docs	Hundreds of candidates
Accuracy	Good	Higher
Use in pipeline	First-stage retrieval	Second-stage reranking

Bi-encoders (text embedding models) handle the retrieval step; cross-encoders handle the reranking step. Both are hosted on SIE.

How do you use a text embedding model with SIE?

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Encode a batch of documents at index time
doc_results = client.encode(
    "BAAI/bge-m3",
    [
        Item(text="self-hosted inference reduces API costs"),
        Item(text="deploy embedding models on your own GPU"),
        Item(text="RAG pipeline with Qdrant and BGE-M3"),
    ],
)
doc_vectors = [r["dense"] for r in doc_results]

# Encode a query at search time
query_vector = client.encode(
    "BAAI/bge-m3",
    Item(text="how to reduce embedding costs"),
    is_query=True,
)["dense"]

# doc_vectors and query_vector are now ready for ANN search

SIE supports 100+ models across encode, score, extract, and generate, including 50+ embedding models. Documents are encoded on your own GPU in your own AWS or GCP account, with no data sent to external APIs.

How do you choose the right embedding model?

Priority	Recommended model(s)
Best general accuracy	BGE-M3, GTE-large, E5-large-v2
Multilingual	BGE-M3, multilingual-e5-large
Fastest / smallest	BGE-small-en, all-MiniLM-L6-v2
Long documents	BGE-M3 (8,192 tokens)
Instruction-following	E5-mistral-7b-instruct, GTE-Qwen
Domain-specific	BGE-M3 + LoRA adapter

Start with BGE-M3 as a default. It handles multilingual, long documents, and dense/sparse/multi-vector retrieval from a single model.

Frequently asked questions

What is the difference between an embedding model and an LLM? An embedding model (encoder) compresses text into a fixed-size vector; it doesn’t generate text. An LLM (decoder) generates text token by token. Embedding models are faster, cheaper, and purpose-built for retrieval. LLMs are used for generation in RAG pipelines.

Do I need GPU to run an embedding model? For production workloads, yes. CPU inference is 10-50× slower and impractical for encoding large corpora or serving real-time queries. SIE manages GPU provisioning on AWS or GCP.

How many tokens can an embedding model handle? Depends on the model. Most handle 512 tokens. BGE-M3 handles 8,192. For longer documents, chunk into overlapping segments and encode each chunk separately.