Why did we open-source our inference engine? Read the post
← All Glossary Articles

How Does Qdrant Work with Embedding Models?

Qdrant is an open-source vector database that stores embedding vectors alongside payload metadata and enables fast approximate nearest neighbour (ANN) search, filtered search, and hybrid (dense + sparse) search. It works with embedding models by receiving the vectors they produce — generated by SIE — and indexing them in an HNSW graph for millisecond-latency retrieval at scale.


Why Qdrant?

Qdrant is a strong default choice for production semantic search and RAG pipelines because:

  • Written in Rust — low latency, high throughput, predictable performance under load
  • Native hybrid search — combines dense vector search with sparse BM25-style search in one query
  • Multi-vector support — stores ColBERT-style token vectors for late interaction retrieval
  • Filterable ANN — filter by metadata without sacrificing recall (adaptive strategy selection)
  • Open source + cloud — run self-hosted or use Qdrant Cloud
  • Active development — among the fastest-evolving vector databases in the ecosystem

How Qdrant and SIE work together

SIE handles the encoding; Qdrant handles the storage and retrieval:

Documents → [SIE: BGE-M3] → vectors → [Qdrant: HNSW index] → stored
Query → [SIE: BGE-M3] → vector → [Qdrant: ANN search] → results

Full pipeline example:

from sie_sdk import SIEClient
from sie_sdk.types import Item
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
sie = SIEClient("http://localhost:8080")
qdrant = QdrantClient("http://localhost:6333")
# 1. Create collection
qdrant.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
# 2. Encode and index documents
encode_results = sie.encode("BAAI/bge-m3", [Item(text=c) for c in document_chunks])
vectors = [r["dense"] for r in encode_results]
qdrant.upsert(
collection_name="documents",
points=[
PointStruct(
id=i,
vector=v.tolist(),
payload={"text": chunk, "source": source, "date": date}
)
for i, (v, chunk, source, date) in enumerate(zip(vectors, document_chunks, sources, dates))
]
)
# 3. Search
query_vector = sie.encode("BAAI/bge-m3", Item(text=user_query), is_query=True)["dense"]
results = qdrant.search(
collection_name="documents",
query_vector=query_vector,
query_filter={"must": [{"key": "date", "range": {"gte": "2024-01-01"}}]},
limit=20
)

Hybrid search with Qdrant and SIE

BGE-M3 produces both dense and sparse vectors. Qdrant’s hybrid search combines them:

from sie_sdk.types import Item
from qdrant_client.models import NamedVector, NamedSparseVector, SparseVector
# Encode with both dense and sparse outputs
query_result = sie.encode(
"BAAI/bge-m3",
Item(text=user_query),
output_types=["dense", "sparse"],
is_query=True,
)
sparse = query_result["sparse"]
# Search with both
results = qdrant.query_points(
collection_name="documents",
prefetch=[
# Dense retrieval
{"query": query_result["dense"], "using": "dense", "limit": 50},
# Sparse retrieval
{"query": SparseVector(indices=sparse["indices"], values=sparse["values"]),
"using": "sparse", "limit": 50},
],
query={"fusion": "rrf"}, # Reciprocal Rank Fusion
limit=20
)

Multi-vector (ColBERT) with Qdrant

Qdrant supports multi-vector storage for ColBERT-style late interaction retrieval:

from sie_sdk.types import Item
from qdrant_client.models import MultiVectorConfig, MultiVectorComparator
# Create collection with multi-vector support
qdrant.create_collection(
collection_name="documents_colbert",
vectors_config={
"colbert": MultiVectorConfig(
size=128,
distance=Distance.COSINE,
multivector_config=MultiVectorComparator.MAX_SIM
)
}
)
# Index ColBERT token vectors
colbert_results = sie.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
output_types=["multivector"],
)
colbert_mvs = [r["multivector"] for r in colbert_results]
# Upsert token vectors per document

Qdrant configuration for production

Key settings to tune for production deployments:

# Collection with tuned HNSW parameters
qdrant.create_collection(
collection_name="production",
vectors_config=VectorParams(
size=1024,
distance=Distance.COSINE,
hnsw_config={"m": 16, "ef_construct": 128},
quantization_config={"scalar": {"type": "int8", "quantile": 0.99}}
)
)
# Set search ef at query time
results = qdrant.search(
collection_name="production",
query_vector=query_vector,
search_params={"hnsw_ef": 128, "exact": False},
limit=20
)

Quantisation (INT8) reduces memory by ~4× with minimal recall loss — recommended for large corpora.


Frequently asked questions

Does Qdrant support real-time updates? Yes. Qdrant’s HNSW index supports incremental inserts and deletes. New vectors are immediately searchable after insertion.

What is Qdrant’s payload filtering performance like? Qdrant uses an adaptive strategy that selects between pre-filtering and post-filtering based on filter selectivity. This typically maintains 95%+ recall even with highly selective filters.

Can I run Qdrant alongside SIE on the same infrastructure? Yes. Both can run in the same Kubernetes cluster. SIE handles the GPU workloads; Qdrant runs on CPU nodes. They communicate over the cluster’s internal network.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.