---
title: LoRA Adapters
description: Fine-tune embeddings with Low-Rank Adaptation without retraining base models.
canonical_url: https://superlinked.com/docs/engine/lora
last_updated: 2026-05-20
---

LoRA (Low-Rank Adaptation) lets you customize embedding models for specific domains. Instead of fine-tuning all model weights, LoRA trains small adapter layers. This reduces training cost and enables swapping adapters at inference time.

## What is LoRA

LoRA freezes the base model and injects trainable low-rank matrices into attention layers. A typical LoRA adapter is 1-5% of the base model size. Multiple LoRA adapters can share the same base model, switching between domains without reloading weights.

**Benefits:**
- Train domain-specific embeddings with minimal data
- Share base model across multiple adapters
- Hot-swap adapters per request
- Reduce GPU memory vs separate fine-tuned models

## Quick Example

Source: [packages/sie_server/src/sie_server/adapters/peft_lora_mixin.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/peft_lora_mixin.py)

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Use a LoRA adapter for domain-specific embeddings
result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of fiduciary duty"),
    options={"lora_id": "org/bge-m3-legal-lora"}
)
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

// Use a LoRA adapter for domain-specific embeddings.
// `options` passthrough is supported on the wire; cast until the TS
// SDK types add the field (see reference/typescript-sdk).
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "breach of fiduciary duty" },
  { options: { lora_id: "org/bge-m3-legal-lora" } } as never,
);
```

## PEFT LoRA (Dynamic Loading)

Source: [packages/sie_server/src/sie_server/adapters/peft_lora_mixin.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/peft_lora_mixin.py)

Most SIE adapters use PEFT (Parameter-Efficient Fine-Tuning) for LoRA support. PEFT provides dynamic loading and hot reload capabilities.

**How it works:**
1. First request with a LoRA triggers async loading
2. PEFT wraps the base model with adapter layers
3. Subsequent requests use the loaded adapter instantly
4. Multiple LoRAs can be loaded simultaneously

#### Python

```python
# First request: triggers LoRA load (may take a few seconds)
result = client.encode("BAAI/bge-m3", Item(text="legal query"), options={"lora_id": "org/legal-lora"})

# Subsequent requests: instant (adapter already loaded)
result = client.encode("BAAI/bge-m3", Item(text="another query"), options={"lora_id": "org/legal-lora"})

# Switch to different LoRA
result = client.encode("BAAI/bge-m3", Item(text="medical query"), options={"lora_id": "org/medical-lora"})
```

#### TypeScript

```typescript
// First request: triggers LoRA load (may take a few seconds)
let result = await client.encode("BAAI/bge-m3", { text: "legal query" }, { options: { lora_id: "org/legal-lora" } } as never);

// Subsequent requests: instant (adapter already loaded)
result = await client.encode("BAAI/bge-m3", { text: "another query" }, { options: { lora_id: "org/legal-lora" } } as never);

// Switch to different LoRA
result = await client.encode("BAAI/bge-m3", { text: "medical query" }, { options: { lora_id: "org/medical-lora" } } as never);
```

**PEFT adapters support hot reload.** Loading a new LoRA does not block ongoing inference requests.

## SGLang LoRA (Pre-loaded)

Source: [packages/sie_server/src/sie_server/adapters/sglang/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/sglang/__init__.py)

For LLM-based embedding models (4B+ parameters), SIE uses SGLang. SGLang requires LoRA adapters to be pre-loaded at server startup.

**Configure in model config YAML:**

```yaml
name: Qwen/Qwen3-Embedding-4B
adapter: sglang
adapter_options_loadtime:
  lora_paths:
    legal: org/qwen3-legal-lora
    medical: /path/to/medical-adapter
  max_loras_per_batch: 8
```

**Use at request time:**

#### Python

```python
# Select pre-loaded LoRA by name
result = client.encode(
    "Qwen/Qwen3-Embedding-4B",
    Item(text="legal document"),
    options={"lora_id": "legal"}
)
```

#### TypeScript

```typescript
// Select pre-loaded LoRA by name
const result = await client.encode(
  "Qwen/Qwen3-Embedding-4B",
  { text: "legal document" },
  { options: { lora_id: "legal" } } as never,
);
```

SGLang handles mixed-LoRA batching internally via S-LoRA. Requests with different LoRAs can batch together.

## Configuring LoRA

### Via Request Options

Pass `lora_id` in the options parameter:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="query"),
    options={"lora_id": "org/my-lora-adapter"}
)
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "query" },
  { options: { lora_id: "org/my-lora-adapter" } } as never,
);
```

### Via Profiles

Source: [packages/sie_server/src/sie_server/config/model.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/config/model.py)

Define LoRA adapters as profiles in your model config. This simplifies client code and enables named presets.

```yaml
name: BAAI/bge-m3
profiles:
  legal:
    instruction: "Given a legal query, retrieve relevant case law"
    lora_id: org/bge-m3-legal-lora
  medical:
    instruction: "Retrieve medical research for this query"
    lora_id: org/bge-m3-medical-lora
```

Use the profile by name:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="breach of contract"),
    options={"profile": "legal"},
)
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "breach of contract" },
  { options: { profile: "legal" } } as never,
);
```

### HTTP API

```bash
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"items": [{"text": "legal query"}], "params": {"options": {"lora_id": "org/legal-lora"}}}'
```

## LoRA Eviction

Source: [packages/sie_server/src/sie_server/core/registry.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/registry.py)

SIE limits the number of loaded LoRA adapters per model to manage GPU memory. When this limit is reached, the least recently used (LRU) adapter is evicted.

**Configuration:**

```yaml
# engine.yaml
max_loras_per_model: 10  # Default: 10 adapters per model
```

Or via environment variable:

```bash
SIE_MAX_LORAS_PER_MODEL=20
```

**Eviction behavior:**
- New LoRA request triggers eviction if limit reached
- Oldest unused adapter is unloaded first
- Evicted adapters reload automatically on next request
- Base model remains loaded (only adapter weights evicted)

Each LoRA adds approximately 1-5% of base model memory. Monitor GPU memory if loading many adapters.

## Supported Adapters

| Adapter Type | LoRA Support | Hot Reload | Notes |
|--------------|--------------|------------|-------|
| PEFT-based (sentence-transformers, BGE-M3, etc.) | Yes | Yes | Dynamic loading |
| SGLang (LLM embeddings) | Yes | No | Pre-loaded at startup |
| ColBERT | No | - | Not yet supported |
| CLIP/SigLIP | No | - | Not yet supported |

## What's Next

- [Model Catalog](/models) - see which models support LoRA
- [Profiles](/docs/engine/profiles/) - bundle LoRA with other options
