Skip to content
Why did we open-source our inference engine? Read the post

Bundles

Python ML libraries often have conflicting dependency requirements. Models using trust_remote_code=True or specialized backends can pin incompatible versions of transformers, torch, or sglang. SIE solves this with bundles: each bundle is a self-contained environment with compatible dependencies, built into its own Docker image.

Each bundle is a YAML file under packages/sie_server/bundles/ that lists the adapters it enables and the pinned dependency versions needed by those adapters. At build time, the Dockerfile selects one bundle via the BUNDLE build arg and installs only that bundle’s deps.


Two bundles are published to GHCR today. The default bundle covers every model unless it needs the SGLang runtime.

BundlePurposeKey Models
defaultEverything that runs on standard transformers + Flash AttentionBGE-M3, E5, Stella, Qwen3, GritLM, NV-Embed, ColBERT, ColPali, ColQwen2, GLiNER, GLiREL, GLiClass, Florence-2, Donut, CLIP, SigLIP, Grounding DINO, OwlV2, SPLADE, and more
sglangLarge LLM embeddings (4B+ params) served through the SGLang backendgte-Qwen2-7B, Qwen3-Embedding-4B, E5-Mistral-7B, Linq-Embed-Mistral, SFR-Embedding-Mistral, SFR-Embedding-2_R, llama-embed-nemotron-8b

The previous gliner and florence2 bundles no longer exist; their adapters and dependencies were folded into default once the underlying version conflicts were resolved.

There is also an experimental transformers5 bundle in the repo for adapters that require transformers>=5.0 (currently LightOnOCR). It is not published to GHCR and is intended for local builds.


The default bundle is the broad, general-purpose image. It bundles the standard transformers, sentence-transformers, Flash Attention, and the NER/vision adapters.

Included adapter families:

  • Dense encoders: BERT (flash), ModernBERT (flash), BGE-M3, Qwen2, XLM-RoBERTa, Nomic, GTE, Stella, sentence-transformers, PyTorch embedding
  • Cross-encoders / rerankers: BERT, ModernBERT, Qwen2, Jina (flash), NLI classification
  • Multi-vector / late-interaction: ColBERT, ColBERT + ModernBERT, ColBERT + rotary, ColPali, ColQwen2, NeMo ColEmbed
  • Sparse: SPLADE (flash), GTE sparse (flash)
  • Vision and vision-language: CLIP, SigLIP, Grounding DINO, OwlV2, Florence-2, Donut
  • Zero-shot NER / extraction: GLiNER, GLiREL, GLiClass

A minimal bundle wired to the SGLang runtime for large LLM embeddings. Use this image when you want memory-efficient serving of 4B+ parameter embedding models.

Included models:

  • Alibaba-NLP/gte-Qwen2-7B-instruct
  • Qwen/Qwen3-Embedding-4B
  • intfloat/e5-mistral-7b-instruct
  • Linq-AI-Research/Linq-Embed-Mistral
  • Salesforce/SFR-Embedding-Mistral, Salesforce/SFR-Embedding-2_R
  • nvidia/llama-embed-nemotron-8b

Each bundle is published for three platforms: cpu, cuda11, cuda12. The image tag format is {version}-{platform}-{bundle}, with a floating latest-{platform}-{bundle} tag that tracks the most recent release.

# Default bundle (CPU)
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default
# Default bundle (CUDA 12, recommended for GPU)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
# SGLang bundle (CUDA 12)
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-sglang
# Pin to a specific release
docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:v0.2.0-cuda12-default

CUDA 11 variants (latest-cuda11-default, latest-cuda11-sglang) are published for older NVIDIA drivers.


Choose a bundle based on the models you need:

  1. Start with default. It covers dense, sparse, multi-vector, cross-encoder, vision, and extraction models, which is the overwhelming majority of use cases.
  2. Use sglang when you need to serve large LLM embedding models (4B+ params) with the SGLang backend. Run it as a second container alongside default and route requests by model name.

Models are loaded on first request. The bundle only determines which models are available inside a given image.


Contact us

Tell us about your use case and we'll get back to you shortly.