Model Adaptation

What is a LoRA Adapter?

A LoRA (Low-Rank Adaptation) adapter is a lightweight set of trainable weight matrices added to specific layers of a pre-trained neural network. During fine-tuning, only the LoRA weights are updated; the base model weights remain frozen. This reduces the number of trainable parameters by 100-1000x compared to full fine-tuning, making domain adaptation practical without large compute budgets.

Why does LoRA matter for inference?

LoRA solves a key problem in deploying embedding models: general-purpose models trained on broad data underperform on specialised domains (legal, medical, financial, code). Full fine-tuning is expensive. It requires updating hundreds of millions of parameters and storing a complete copy of the model for each domain.

LoRA adapters are small (typically 10-100MB vs 1-4GB for a full model) and can be swapped at runtime. This means a single base model can serve multiple domains by loading the appropriate adapter, without restarting the inference server.

SIE supports LoRA hot-loading: swap adapters between requests with zero downtime.

How does LoRA work?

A standard neural network weight matrix W has dimensions d × k. Full fine-tuning updates every element of W, that’s d × k parameters.

LoRA instead decomposes the weight update into two low-rank matrices:

W' = W + ΔW = W + BA

Where:

B has dimensions d × r
A has dimensions r × k
r is the rank (typically 4-64, much smaller than d or k)

During fine-tuning, only A and B are trained. The original W is frozen.

Parameters saved = d×k − (d×r + r×k) = d×k − r×(d+k)

For a weight matrix of 768×768 with rank r=16: full fine-tuning = 589,824 parameters; LoRA = 24,576 parameters, a 24× reduction.

Which layers get LoRA adapters?

LoRA is typically applied to the attention weight matrices in transformer layers:

Query projection (Wq)
Key projection (Wk)
Value projection (Wv)
Output projection (Wo)

Optionally also applied to the feed-forward layers. More layers = more parameters = more expressivity, at the cost of size.

How do you use LoRA with SIE?

SIE supports LoRA hot-loading, so you can apply a domain-specific adapter at inference time:

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# General-purpose encoding
general_vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]

# Legal domain encoding with LoRA adapter
legal_vectors = [
    r["dense"]
    for r in client.encode(
        "BAAI/bge-m3",
        [Item(text=d) for d in documents],
        options={"lora_id": "org/bge-m3-legal-lora"},
    )
]

# Medical domain encoding with different adapter
medical_vectors = [
    r["dense"]
    for r in client.encode(
        "BAAI/bge-m3",
        [Item(text=d) for d in documents],
        options={"lora_id": "org/bge-m3-medical-lora"},
    )
]

Multiple adapters can be loaded simultaneously and selected per-request. The base model weights are shared; only the small adapter matrices differ.

LoRA vs full fine-tuning vs prompt tuning

	Full fine-tuning	LoRA	Prompt tuning
Parameters updated	All (100%)	~0.1-1%	<0.01%
Storage per domain	Full model copy	Small adapter	Tiny prompt
Quality	Highest	Near-full	Lower
Training cost	High	Low	Lowest
Inference cost	Normal	Normal + tiny overhead	Normal
Hot-swap at runtime	✗	✓ (SIE)	✓

For most domain adaptation use cases, LoRA provides the best accuracy-cost trade-off.

Rank selection: how do you choose r?

The rank r controls the adapter’s capacity:

Rank	Parameters	When to use
4-8	Minimal	Simple style/tone adaptation
16	Low	Standard domain adaptation
32	Medium	Complex domain shift
64+	High	Approaching full fine-tune quality

Start with r=16 for most domain adaptation tasks. Increase if validation metrics plateau.

Training a LoRA adapter for your domain

You need (query, positive document) pairs from your domain, the same training signal used for embedding model training:

from peft import LoraConfig, get_peft_model
from transformers import AutoModel

# Load base model
base_model = AutoModel.from_pretrained("BAAI/bge-m3")

# Apply LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query", "key", "value"],
    lora_dropout=0.1,
)
peft_model = get_peft_model(base_model, lora_config)

# Train on domain-specific (query, positive) pairs
# ... training loop ...

# Save adapter only (~50MB vs ~2GB for full model)
peft_model.save_pretrained("legal-lora-adapter/")

The adapter can then be loaded into SIE for hot-swap deployment.

Frequently asked questions

Does a LoRA adapter change model inference speed? Negligibly. The adapter matrices are small and the extra computation is minimal. SIE’s batching absorbs this overhead.

Can I combine LoRA with quantisation? Yes. QLoRA (Quantised LoRA) quantises the base model to 4-bit precision and adds LoRA adapters in full precision. This is a common approach for fine-tuning large models on consumer hardware.

How much domain-specific training data do I need? LoRA adapters can be effective with as few as hundreds of (query, document) pairs. More data helps, but the low parameter count means LoRA is significantly less data-hungry than full fine-tuning.