LanceDB

The sie-lancedb package provides LanceDB-native embedding functions, a reranker for hybrid search, and an entity extractor for table enrichment (Python). Embeddings are computed automatically on table.add() and table.search() - no manual encoding needed.

How it works: You use SIE as an embedding function with LanceDB’s schema helpers. LanceDB handles the rest - calling SIE on insert and query, and persisting the embedding config in table metadata.

pip install sie-lancedb

This installs sie-sdk, lancedb (v0.17+), pylance, and pyarrow as dependencies.

pnpm add @superlinked/sie-lancedb @lancedb/lancedb

Start the Server

# SIE server
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

# Or with GPU
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

Embedding Function

SIEEmbeddingFunction is registered as "sie" in LanceDB’s embedding function registry. Define your schema once, and embeddings are computed automatically on insert and search.

import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
import sie_lancedb  # registers "sie" and "sie-multivector"

sie = get_registry().get("sie").create(
    model="BAAI/bge-m3",
    base_url="http://localhost:8080",
)

class Documents(LanceModel):
    text: str = sie.SourceField()
    vector: Vector(sie.ndims()) = sie.VectorField()

db = lancedb.connect("~/.lancedb")
table = db.create_table("docs", schema=Documents, mode="overwrite")

# Embeddings computed automatically
table.add([
    {"text": "Machine learning is a subset of AI."},
    {"text": "Neural networks use multiple layers."},
    {"text": "Python is popular for ML development."},
])

# Query embedding computed automatically
results = table.search("What is deep learning?").limit(3).to_list()
for r in results:
    print(r["text"])

Any model SIE supports works - just change the model parameter:

sie = get_registry().get("sie").create(model="NovaSearch/stella_en_400M_v5")
sie = get_registry().get("sie").create(model="nomic-ai/nomic-embed-text-v2-moe")

See the Model Catalog for all 85+ supported models.

Configuration Options

Parameter	Type	Default	Description
`base_url`	`str`	`http://localhost:8080`	SIE server URL
`model`	`str`	`BAAI/bge-m3`	Model to use for embeddings (catalog)
`instruction`	`str`	`None`	Instruction prefix for instruction-tuned models (e.g., E5)
`output_dtype`	`str`	`None`	Output data type (`float32`, `float16`, `int8`, `binary`)
`gpu`	`str`	`None`	Target GPU type for routing
`options`	`dict`	`None`	Model-specific options
`timeout_s`	`float`	`180.0`	Request timeout in seconds

Hybrid Search with Reranker

SIEReranker plugs into LanceDB’s hybrid search pipeline. It uses SIE’s cross-encoder score() to rerank combined vector + full-text search results.

from sie_lancedb import SIEReranker

# Create FTS index for hybrid search
table.create_fts_index("text", replace=True)

# Hybrid search with SIE reranking
results = (
    table.search("What is deep learning?", query_type="hybrid")
    .rerank(SIEReranker(model="jinaai/jina-reranker-v2-base-multilingual"))
    .limit(5)
    .to_list()
)

for r in results:
    print(f"{r['_relevance_score']:.3f}  {r['text']}")

The reranker also works with pure vector or pure FTS search via .rerank().

Entity Extraction

SIEExtractor adds entity extraction to LanceDB’s data enrichment workflows. Extract entities from a text column and merge the results back as a structured Arrow column - enabling filtered search on extracted entities.

from sie_lancedb import SIEExtractor

extractor = SIEExtractor(
    base_url="http://localhost:8080",
    model="urchade/gliner_multi-v2.1",
)

# Enrich the table: reads text, extracts entities, merges back
extractor.enrich_table(
    table,
    source_column="text",
    target_column="entities",
    labels=["person", "technology", "organization"],
    id_column="id",
)

The entities column stores structured Arrow data (list<struct<text, label, score, start, end, bbox>>) so you can filter on extracted entities in queries.

For manual control, use extract() directly:

entities = extractor.extract(
    ["Tim Cook leads Apple Inc.", "Elon Musk founded SpaceX."],
    labels=["person", "organization"],
)
# [[{"text": "Tim Cook", "label": "person", "score": 0.98, ...}, ...], ...]

Multi-Vector (ColBERT)

SIEMultiVectorEmbeddingFunction (registered as "sie-multivector") works with LanceDB’s native MultiVector type and MaxSim scoring for ColBERT and ColPali models.

from lancedb.pydantic import MultiVector

sie_colbert = get_registry().get("sie-multivector").create(
    model="jinaai/jina-colbert-v2",
    base_url="http://localhost:8080",
)

class ColBERTDocs(LanceModel):
    text: str = sie_colbert.SourceField()
    vector: MultiVector(sie_colbert.ndims()) = sie_colbert.VectorField()

table = db.create_table("colbert_docs", schema=ColBERTDocs, mode="overwrite")
table.add([{"text": "Machine learning is a subset of AI."}])

# MaxSim search - query and document multi-vectors are compared token-by-token
results = table.search("What is ML?").limit(5).to_list()

What’s Next

Encode Text - embedding API details and output types
Score / Rerank - cross-encoder reranking
Extract - entity extraction API
Model Catalog - all supported models
Integrations - all supported frameworks and vector stores