---
title: Model Adapters
description: Thin wrappers that connect model families to the inference engine.
canonical_url: https://superlinked.com/docs/engine/adapters
last_updated: 2026-05-25
---

Adapters are thin wrappers that connect model families to the inference engine. Each adapter implements a standard protocol for loading, unloading, and running inference. This enables SIE to support 80+ models with consistent behavior.

## What Are Adapters

Source: [packages/sie_server/src/sie_server/adapters/base.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/base.py)

An adapter wraps a specific model architecture or library. It handles:

- **Loading** model weights onto a device (CPU, CUDA, MPS)
- **Inference** via encode(), score(), or extract() methods
- **Unloading** with proper memory cleanup

One adapter can serve many models. For example, `SentenceTransformerDenseAdapter` works with all-MiniLM, E5, BGE, and hundreds of other compatible models.

## Adapter Protocol

Every adapter exposes the same core lifecycle:
- **Capabilities** to declare input/output support
- **Dimensions** for output shapes
- **Load/Unload** for device placement and cleanup
- **Encode/Score/Extract** for inference

### Capabilities

Source: [packages/sie_server/src/sie_server/adapters/base.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/base.py)

Each adapter declares its capabilities:

| Field | Type | Description |
|-------|------|-------------|
| `inputs` | `list[str]` | Supported input modalities: "text", "image", "audio" |
| `outputs` | `list[str]` | Output types: "dense", "sparse", "multivector" |
| `can_score` | `bool` | Supports reranking via score() |
| `can_extract` | `bool` | Supports extraction via extract() |

Capabilities are static metadata in the model config and adapter implementation.

### Dimensions

Adapters report output dimensions for validation and client usage:

| Field | Description |
|-------|-------------|
| `dense` | Dense vector dimensionality (e.g., 1024) |
| `sparse` | Vocabulary size for sparse vectors |
| `multivector` | Per-token embedding dimension |

## Compute Engines

Adapters use different compute backends depending on model architecture:

### Flash Attention 2

Source: [packages/sie_server/src/sie_server/adapters/bert_flash/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/bert_flash/__init__.py)

Flash Attention with variable-length sequences eliminates padding waste. Uses `flash_attn_varlen_func` to pack sequences and process without padding tokens.

**Benefits:**
- Higher throughput (no wasted compute on padding)
- Lower memory usage (no padded tensors)
- 2-4x speedup on typical workloads

**Used by:** BertFlashAdapter, Qwen2FlashAdapter, SPLADEFlashAdapter, ColBERTAdapter

### SGLang

Source: [packages/sie_server/src/sie_server/adapters/sglang/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/sglang/__init__.py)

SGLang provides memory-efficient inference for large LLM embedding models (4B+). Pre-allocates KV cache to prevent OOM under concurrent load.

**Benefits:**
- Stable memory usage with concurrent requests
- Handles 4B-8B parameter models reliably
- LoRA adapter support via HTTP API

**Used by:** SGLangEmbeddingAdapter for Qwen3-Embedding-4B, GTE-Qwen2-7B, etc.

### PyTorch with SDPA

Standard PyTorch with Scaled Dot-Product Attention. Uses native transformers libraries like sentence-transformers.

**Benefits:**
- Broadest compatibility
- Works on CPU, CUDA, and MPS
- Simple debugging

**Used by:** SentenceTransformerDenseAdapter, CrossEncoderAdapter, CLIPAdapter

## Adapter Catalog

### Dense Embedding Adapters

| Adapter | Compute | Models |
|---------|---------|--------|
| `SentenceTransformerDenseAdapter` | SDPA | all-MiniLM, BGE-base, GTE-multilingual |
| `BertFlashAdapter` | Flash | E5-v2 series, BERT-based models |
| `Qwen2FlashAdapter` | Flash | stella_en_1.5B_v5, GTE-Qwen2 series |
| `SGLangEmbeddingAdapter` | SGLang | Qwen3-Embedding-4B/8B, E5-Mistral-7B |
| `BGEM3Adapter` | SDPA | BAAI/bge-m3 (dense, sparse, multivector) |
| `NoMicFlashAdapter` | Flash | nomic-embed-text-v2-moe |
| `XLMRobertaFlashAdapter` | Flash | multilingual-e5-large, XLM-R models |
| `RoPEFlashAdapter` | Flash | Models with rotary position embeddings |

### Sparse Embedding Adapters

| Adapter | Compute | Models |
|---------|---------|--------|
| `SentenceTransformerSparseAdapter` | SDPA | sentence-transformers SparseEncoder models |
| `SPLADEFlashAdapter` | Flash | SPLADE-v3, OpenSearch Neural Sparse |
| `BGEM3Adapter` | SDPA | BAAI/bge-m3 sparse output |

### Multi-Vector Adapters (ColBERT)

Source: [packages/sie_server/src/sie_server/adapters/colbert/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/colbert/__init__.py)

| Adapter | Compute | Models |
|---------|---------|--------|
| `ColBERTAdapter` | Flash | jina-colbert-v2, colbertv2.0, answerai-colbert-small |
| `ColBERTModernBERTFlashAdapter` | Flash | GTE-ModernColBERT-v1, Reason-ModernColBERT |
| `ColBERTRotaryFlashAdapter` | Flash | ColBERT models with RoPE |

### Reranker Adapters

Source: [packages/sie_server/src/sie_server/adapters/cross_encoder/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/cross_encoder/__init__.py)

| Adapter | Compute | Models |
|---------|---------|--------|
| `CrossEncoderAdapter` | SDPA | BGE-reranker, Jina-reranker, MS-MARCO |
| `BertFlashCrossEncoderAdapter` | Flash | BERT-based rerankers |
| `JinaFlashCrossEncoderAdapter` | Flash | jina-reranker-v2-base-multilingual |
| `ModernBERTFlashCrossEncoderAdapter` | Flash | gte-reranker-modernbert-base |
| `Qwen2FlashCrossEncoderAdapter` | Flash | Qwen2-based rerankers |

### Vision Adapters

Source: [packages/sie_server/src/sie_server/adapters/clip/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/clip/__init__.py)

| Adapter | Modality | Models |
|---------|----------|--------|
| `CLIPAdapter` | Text + Image | openai/clip-vit-base-patch32, LAION CLIP |
| `SigLIPAdapter` | Text + Image | google/siglip-so400m-patch14 |
| `ColPaliAdapter` | Image | vidore/colpali-v1.3-hf |
| `ColQwen2Adapter` | Image | vidore/colqwen2.5-v0.2 |
| `NemoColEmbedAdapter` | Image | nvidia/llama-nemoretriever-colembed-3b |

### Extraction Adapters

Source: [packages/sie_server/src/sie_server/adapters/gliner/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/gliner/__init__.py)

| Adapter | Task | Models |
|---------|------|--------|
| `GLiNERAdapter` | Zero-shot NER | gliner_multi-v2.1, NuNER_Zero |
| `GLiRELAdapter` | Relation extraction | glirel-large-v0 |
| `GLiClassAdapter` | Classification | gliclass-base-v1.0 |
| `Florence2Adapter` | Document understanding | Florence-2-base, Florence-2-large |
| `DonutAdapter` | Document parsing | donut-base-finetuned-docvqa |
| `GroundingDinoAdapter` | Object detection | grounding-dino-tiny, grounding-dino-base |
| `OwlV2Adapter` | Zero-shot detection | owlv2-base-patch16-ensemble |
| `NLIClassificationAdapter` | Zero-shot classification | deberta-v3-large-zeroshot-v2.0 |

## Memory Management

Source: [packages/sie_server/src/sie_server/adapters/base.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/base.py)

Adapters must fully release GPU memory in `unload()` so LRU eviction is safe.

```python
    if self._model is not None:
        del self._model
        self._model = None

    self._device = None

    # Release GPU memory
    import gc
    gc.collect()
    if device and device.startswith("cuda"):
        torch.cuda.empty_cache()
    elif device == "mps":
        torch.mps.empty_cache()
```

The registry tracks memory usage via `memory_footprint()` for LRU eviction.

## LoRA Support

Some adapters support dynamic LoRA adapter loading:

```python
def supports_lora(self) -> bool:
    """Return True if this adapter supports LoRA."""
    ...

def load_lora(self, lora_path: str) -> int:
    """Load a LoRA adapter, return memory usage."""
    ...

def set_active_lora(self, lora_name: str | None) -> None:
    """Switch active LoRA before inference."""
    ...
```

SGLang adapters use the HTTP API for LoRA switching. PEFT-based adapters use the `PEFTLoRAMixin` for in-process loading.

## Writing Custom Adapters

For adding support for new model architectures, see [Adding Models](/docs/engine/adding-models/).

The typical workflow:
1. Identify the model architecture (BERT, Qwen2, custom)
2. Choose a compute backend (SDPA, Flash, SGLang)
3. Implement the adapter protocol
4. Create a model config in `packages/sie_server/models/`

## What's Next

- [Adding Models](/docs/engine/adding-models/) - configure new models
- [Model Catalog](/models#task=encode) - supported encode models
