Skip to content
Why did we open-source our inference engine? Read the post

How to generate embeddings with SIE

Dense embeddings are fixed-dimension float vectors that capture semantic meaning. SIE’s encode primitive converts text or images into these vectors using any of 85+ supported models. The resulting embeddings power semantic search, RAG retrieval, and recommendation systems.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.encode("BAAI/bge-m3", Item(text="Your text here"))
print(f"Dimensions: {len(result['dense'])}") # 1024

Not sure which model to use? See the Model Selection Guide or the full model catalog.


Use dense embeddings when:

  • You need semantic similarity matching, not just exact keyword matching
  • Your vector database supports ANN search (Qdrant, Weaviate, Chroma, and LanceDB all work with SIE)
  • You want a simple, fast retrieval baseline to start with

Consider a different approach when:


result = client.encode("BAAI/bge-m3", Item(text="Hello world"))
print(result["dense"][:5]) # First 5 dimensions

Pass a list of items for efficient GPU-batched processing:

items = [
Item(text="First document"),
Item(text="Second document"),
Item(text="Third document"),
]
results = client.encode("BAAI/bge-m3", items)

The server batches requests automatically. You do not need to manage batch sizes manually.

items = [
Item(id="doc-1", text="First document"),
Item(id="doc-2", text="Second document"),
]
results = client.encode("BAAI/bge-m3", items)
for result in results:
print(f"{result['id']}: {len(result['dense'])} dims")

Should I Encode Queries and Documents Differently?

Section titled “Should I Encode Queries and Documents Differently?”

Yes, for asymmetric models. Queries are short and question-like; documents are longer content. Many models are trained to distinguish these and perform better when you tell them which is which.

# Encode a search query
query = client.encode(
"BAAI/bge-m3",
Item(text="What is machine learning?"),
is_query=True,
)
# Encode documents (default, no is_query flag needed)
documents = client.encode(
"BAAI/bge-m3",
[Item(text="Machine learning is..."), Item(text="Deep learning uses...")],
)

For instruction-tuned models like Alibaba-NLP/gte-Qwen2-1.5B-instruct, pass an explicit instruction string to guide embedding behaviour:

result = client.encode(
"Alibaba-NLP/gte-Qwen2-1.5B-instruct",
Item(text="What is Python?"),
instruction="Represent this query for retrieving programming tutorials:"
)

By default, encode returns dense embeddings. Models that support it (such as BAAI/bge-m3) can return sparse and multi-vector outputs in a single call:

result = client.encode(
"BAAI/bge-m3",
Item(text="text"),
output_types=["dense", "sparse", "multivector"]
)
print(result["dense"]) # 1024-dim float array
print(result["sparse"]) # {"indices": [...], "values": [...]}
print(result["multivector"]) # [num_tokens, 1024] array

Not all models support all output types. BAAI/bge-m3 is the main model supporting all three. Most models support dense only.

FieldTypeDescription
idstr or NoneItem ID if provided
denseNDArray[float32]Dense embedding vector
sparseSparseResult or NoneSparse indices and values
multivectorNDArray[float32] or NonePer-token embeddings (ColBERT)
timingTimingInfoRequest timing breakdown

ModelDimsMax LengthNotes
BAAI/bge-m310248192Multilingual; supports dense, sparse, multivector
NovaSearch/stella_en_400M_v51024512Best English quality per GB of VRAM
intfloat/e5-base-v2768512Solid all-rounder
sentence-transformers/all-MiniLM-L6-v2384256Fastest and most lightweight

See How do I choose the right model? or the model catalog.


The server defaults to msgpack for efficient numpy transport. To use plain JSON:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"items": [{"text": "Your text here"}]}'

See the full HTTP API Reference.


What is the difference between dense and sparse embeddings? Dense embeddings represent meaning as a fixed-length float vector (for example, 1024 numbers). Sparse embeddings represent text as a weighted set of vocabulary tokens, which is useful for keyword matching. Most use cases start with dense. Add sparse when you need hybrid search. See Sparse and Hybrid Search.

What embedding dimensions should I use? Higher dimensions capture more nuance but use more memory and slow down ANN search. 384-dim models like all-MiniLM are fast but less precise. 1024-dim models like bge-m3 and stella are the standard production choice. 4096-dim models like NV-Embed-v2 give the best quality at high memory cost. Start at 1024.

Can SIE generate image embeddings? Yes. SIE supports multimodal models like google/siglip-so400m-patch14-384 that embed both text and images into the same vector space. See Multimodal Embeddings.

Does SIE integrate with LangChain, LlamaIndex, or Haystack? Yes. SIE has first-class integrations with LangChain, LlamaIndex, Haystack, Qdrant, Weaviate, Chroma, LanceDB, and more. See Integrations.

Contact us

Tell us about your use case and we'll get back to you shortly.