What is a LoRA Adapter?
A LoRA (Low-Rank Adaptation) adapter is a lightweight set of trainable weight matrices added to specific layers of a pre-trained neural network. During fine-tuning, only the LoRA weights are updated — the base model weights remain frozen. This reduces the number of trainable parameters by 100–1000x compared to full fine-tuning, making domain adaptation practical without large compute budgets.
Why does LoRA matter for inference?
LoRA solves a key problem in deploying embedding models: general-purpose models trained on broad data underperform on specialised domains (legal, medical, financial, code). Full fine-tuning is expensive — it requires updating hundreds of millions of parameters and storing a complete copy of the model for each domain.
LoRA adapters are small (typically 10–100MB vs 1–4GB for a full model) and can be swapped at runtime. This means a single base model can serve multiple domains by loading the appropriate adapter — without restarting the inference server.
SIE supports LoRA hot-loading: swap adapters between requests with zero downtime.
How does LoRA work?
A standard neural network weight matrix W has dimensions d × k. Full fine-tuning updates every element of W — that’s d × k parameters.
LoRA instead decomposes the weight update into two low-rank matrices:
W' = W + ΔW = W + BAWhere:
Bhas dimensionsd × rAhas dimensionsr × kris the rank (typically 4–64, much smaller than d or k)
During fine-tuning, only A and B are trained. The original W is frozen.
Parameters saved = d×k − (d×r + r×k) = d×k − r×(d+k)For a weight matrix of 768×768 with rank r=16: full fine-tuning = 589,824 parameters; LoRA = 24,576 parameters — a 24× reduction.
Which layers get LoRA adapters?
LoRA is typically applied to the attention weight matrices in transformer layers:
- Query projection (Wq)
- Key projection (Wk)
- Value projection (Wv)
- Output projection (Wo)
Optionally also applied to the feed-forward layers. More layers = more parameters = more expressivity, at the cost of size.
How do you use LoRA with SIE?
SIE supports LoRA hot-loading — apply a domain-specific adapter at inference time:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# General-purpose encodinggeneral_vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]
# Legal domain encoding with LoRA adapterlegal_vectors = [ r["dense"] for r in client.encode( "BAAI/bge-m3", [Item(text=d) for d in documents], options={"lora_id": "org/bge-m3-legal-lora"}, )]
# Medical domain encoding with different adaptermedical_vectors = [ r["dense"] for r in client.encode( "BAAI/bge-m3", [Item(text=d) for d in documents], options={"lora_id": "org/bge-m3-medical-lora"}, )]Multiple adapters can be loaded simultaneously and selected per-request. The base model weights are shared — only the small adapter matrices differ.
LoRA vs full fine-tuning vs prompt tuning
| Full fine-tuning | LoRA | Prompt tuning | |
|---|---|---|---|
| Parameters updated | All (100%) | ~0.1–1% | <0.01% |
| Storage per domain | Full model copy | Small adapter | Tiny prompt |
| Quality | Highest | Near-full | Lower |
| Training cost | High | Low | Lowest |
| Inference cost | Normal | Normal + tiny overhead | Normal |
| Hot-swap at runtime | ✗ | ✓ (SIE) | ✓ |
For most domain adaptation use cases, LoRA provides the best accuracy-cost trade-off.
Rank selection: how do you choose r?
The rank r controls the adapter’s capacity:
| Rank | Parameters | When to use |
|---|---|---|
| 4–8 | Minimal | Simple style/tone adaptation |
| 16 | Low | Standard domain adaptation |
| 32 | Medium | Complex domain shift |
| 64+ | High | Approaching full fine-tune quality |
Start with r=16 for most domain adaptation tasks. Increase if validation metrics plateau.
Training a LoRA adapter for your domain
You need (query, positive document) pairs from your domain — the same training signal used for embedding model training:
from peft import LoraConfig, get_peft_modelfrom transformers import AutoModel
# Load base modelbase_model = AutoModel.from_pretrained("BAAI/bge-m3")
# Apply LoRA configurationlora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["query", "key", "value"], lora_dropout=0.1,)peft_model = get_peft_model(base_model, lora_config)
# Train on domain-specific (query, positive) pairs# ... training loop ...
# Save adapter only (~50MB vs ~2GB for full model)peft_model.save_pretrained("legal-lora-adapter/")The adapter can then be loaded into SIE for hot-swap deployment.
Frequently asked questions
Does a LoRA adapter change model inference speed? Negligibly. The adapter matrices are small and the extra computation is minimal. SIE’s batching absorbs this overhead.
Can I combine LoRA with quantisation? Yes — QLoRA (Quantised LoRA) quantises the base model to 4-bit precision and adds LoRA adapters in full precision. This is a common approach for fine-tuning large models on consumer hardware.
How much domain-specific training data do I need? LoRA adapters can be effective with as few as hundreds of (query, document) pairs. More data helps, but the low parameter count means LoRA is significantly less data-hungry than full fine-tuning.