LanceDB
The sie-lancedb package provides LanceDB-native embedding functions, a reranker for hybrid search, and an entity extractor for table enrichment (Python). Embeddings are computed automatically on table.add() and table.search() - no manual encoding needed.
How it works: You use SIE as an embedding function with LanceDB’s schema helpers. LanceDB handles the rest - calling SIE on insert and query, and persisting the embedding config in table metadata.
Installation
Section titled “Installation”pip install sie-lancedbThis installs sie-sdk, lancedb (v0.17+), pylance, and pyarrow as dependencies.
pnpm add @superlinked/sie-lancedb @lancedb/lancedbStart the Server
Section titled “Start the Server”# SIE serverdocker run -p 8080:8080 ghcr.io/superlinked/sie-server:default
# Or with GPUdocker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:defaultEmbedding Function
Section titled “Embedding Function”SIEEmbeddingFunction is registered as "sie" in LanceDB’s embedding function registry. Define your schema once, and embeddings are computed automatically on insert and search.
import lancedbfrom lancedb.embeddings import get_registryfrom lancedb.pydantic import LanceModel, Vectorimport sie_lancedb # registers "sie" and "sie-multivector"
sie = get_registry().get("sie").create( model="BAAI/bge-m3", base_url="http://localhost:8080",)
class Documents(LanceModel): text: str = sie.SourceField() vector: Vector(sie.ndims()) = sie.VectorField()
db = lancedb.connect("~/.lancedb")table = db.create_table("docs", schema=Documents, mode="overwrite")
# Embeddings computed automaticallytable.add([ {"text": "Machine learning is a subset of AI."}, {"text": "Neural networks use multiple layers."}, {"text": "Python is popular for ML development."},])
# Query embedding computed automaticallyresults = table.search("What is deep learning?").limit(3).to_list()for r in results: print(r["text"])Any model SIE supports works - just change the model parameter:
sie = get_registry().get("sie").create(model="NovaSearch/stella_en_400M_v5")sie = get_registry().get("sie").create(model="nomic-ai/nomic-embed-text-v2-moe")See the Model Catalog for all 85+ supported models.
Configuration Options
Section titled “Configuration Options”| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | str | http://localhost:8080 | SIE server URL |
model | str | BAAI/bge-m3 | Model to use for embeddings (catalog) |
instruction | str | None | Instruction prefix for instruction-tuned models (e.g., E5) |
output_dtype | str | None | Output data type (float32, float16, int8, binary) |
gpu | str | None | Target GPU type for routing |
options | dict | None | Model-specific options |
timeout_s | float | 180.0 | Request timeout in seconds |
Hybrid Search with Reranker
Section titled “Hybrid Search with Reranker”SIEReranker plugs into LanceDB’s hybrid search pipeline. It uses SIE’s cross-encoder score() to rerank combined vector + full-text search results.
from sie_lancedb import SIEReranker
# Create FTS index for hybrid searchtable.create_fts_index("text", replace=True)
# Hybrid search with SIE rerankingresults = ( table.search("What is deep learning?", query_type="hybrid") .rerank(SIEReranker(model="jinaai/jina-reranker-v2-base-multilingual")) .limit(5) .to_list())
for r in results: print(f"{r['_relevance_score']:.3f} {r['text']}")The reranker also works with pure vector or pure FTS search via .rerank().
Entity Extraction
Section titled “Entity Extraction”SIEExtractor adds entity extraction to LanceDB’s data enrichment workflows. Extract entities from a text column and merge the results back as a structured Arrow column - enabling filtered search on extracted entities.
from sie_lancedb import SIEExtractor
extractor = SIEExtractor( base_url="http://localhost:8080", model="urchade/gliner_multi-v2.1",)
# Enrich the table: reads text, extracts entities, merges backextractor.enrich_table( table, source_column="text", target_column="entities", labels=["person", "technology", "organization"], id_column="id",)The entities column stores structured Arrow data (list<struct<text, label, score, start, end, bbox>>) so you can filter on extracted entities in queries.
For manual control, use extract() directly:
entities = extractor.extract( ["Tim Cook leads Apple Inc.", "Elon Musk founded SpaceX."], labels=["person", "organization"],)# [[{"text": "Tim Cook", "label": "person", "score": 0.98, ...}, ...], ...]Multi-Vector (ColBERT)
Section titled “Multi-Vector (ColBERT)”SIEMultiVectorEmbeddingFunction (registered as "sie-multivector") works with LanceDB’s native MultiVector type and MaxSim scoring for ColBERT and ColPali models.
from lancedb.pydantic import MultiVector
sie_colbert = get_registry().get("sie-multivector").create( model="jinaai/jina-colbert-v2", base_url="http://localhost:8080",)
class ColBERTDocs(LanceModel): text: str = sie_colbert.SourceField() vector: MultiVector(sie_colbert.ndims()) = sie_colbert.VectorField()
table = db.create_table("colbert_docs", schema=ColBERTDocs, mode="overwrite")table.add([{"text": "Machine learning is a subset of AI."}])
# MaxSim search - query and document multi-vectors are compared token-by-tokenresults = table.search("What is ML?").limit(5).to_list()What’s Next
Section titled “What’s Next”- Encode Text - embedding API details and output types
- Score / Rerank - cross-encoder reranking
- Extract - entity extraction API
- Model Catalog - all supported models
- Integrations - all supported frameworks and vector stores