Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is a LoRA Adapter?

A LoRA (Low-Rank Adaptation) adapter is a lightweight set of trainable weight matrices added to specific layers of a pre-trained neural network. During fine-tuning, only the LoRA weights are updated — the base model weights remain frozen. This reduces the number of trainable parameters by 100–1000x compared to full fine-tuning, making domain adaptation practical without large compute budgets.


Why does LoRA matter for inference?

LoRA solves a key problem in deploying embedding models: general-purpose models trained on broad data underperform on specialised domains (legal, medical, financial, code). Full fine-tuning is expensive — it requires updating hundreds of millions of parameters and storing a complete copy of the model for each domain.

LoRA adapters are small (typically 10–100MB vs 1–4GB for a full model) and can be swapped at runtime. This means a single base model can serve multiple domains by loading the appropriate adapter — without restarting the inference server.

SIE supports LoRA hot-loading: swap adapters between requests with zero downtime.


How does LoRA work?

A standard neural network weight matrix W has dimensions d × k. Full fine-tuning updates every element of W — that’s d × k parameters.

LoRA instead decomposes the weight update into two low-rank matrices:

W' = W + ΔW = W + BA

Where:

  • B has dimensions d × r
  • A has dimensions r × k
  • r is the rank (typically 4–64, much smaller than d or k)

During fine-tuning, only A and B are trained. The original W is frozen.

Parameters saved = d×k − (d×r + r×k) = d×k − r×(d+k)

For a weight matrix of 768×768 with rank r=16: full fine-tuning = 589,824 parameters; LoRA = 24,576 parameters — a 24× reduction.


Which layers get LoRA adapters?

LoRA is typically applied to the attention weight matrices in transformer layers:

  • Query projection (Wq)
  • Key projection (Wk)
  • Value projection (Wv)
  • Output projection (Wo)

Optionally also applied to the feed-forward layers. More layers = more parameters = more expressivity, at the cost of size.


How do you use LoRA with SIE?

SIE supports LoRA hot-loading — apply a domain-specific adapter at inference time:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# General-purpose encoding
general_vectors = [r["dense"] for r in client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])]
# Legal domain encoding with LoRA adapter
legal_vectors = [
r["dense"]
for r in client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
options={"lora_id": "org/bge-m3-legal-lora"},
)
]
# Medical domain encoding with different adapter
medical_vectors = [
r["dense"]
for r in client.encode(
"BAAI/bge-m3",
[Item(text=d) for d in documents],
options={"lora_id": "org/bge-m3-medical-lora"},
)
]

Multiple adapters can be loaded simultaneously and selected per-request. The base model weights are shared — only the small adapter matrices differ.


LoRA vs full fine-tuning vs prompt tuning

Full fine-tuningLoRAPrompt tuning
Parameters updatedAll (100%)~0.1–1%<0.01%
Storage per domainFull model copySmall adapterTiny prompt
QualityHighestNear-fullLower
Training costHighLowLowest
Inference costNormalNormal + tiny overheadNormal
Hot-swap at runtime✓ (SIE)

For most domain adaptation use cases, LoRA provides the best accuracy-cost trade-off.


Rank selection: how do you choose r?

The rank r controls the adapter’s capacity:

RankParametersWhen to use
4–8MinimalSimple style/tone adaptation
16LowStandard domain adaptation
32MediumComplex domain shift
64+HighApproaching full fine-tune quality

Start with r=16 for most domain adaptation tasks. Increase if validation metrics plateau.


Training a LoRA adapter for your domain

You need (query, positive document) pairs from your domain — the same training signal used for embedding model training:

from peft import LoraConfig, get_peft_model
from transformers import AutoModel
# Load base model
base_model = AutoModel.from_pretrained("BAAI/bge-m3")
# Apply LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["query", "key", "value"],
lora_dropout=0.1,
)
peft_model = get_peft_model(base_model, lora_config)
# Train on domain-specific (query, positive) pairs
# ... training loop ...
# Save adapter only (~50MB vs ~2GB for full model)
peft_model.save_pretrained("legal-lora-adapter/")

The adapter can then be loaded into SIE for hot-swap deployment.


Frequently asked questions

Does a LoRA adapter change model inference speed? Negligibly. The adapter matrices are small and the extra computation is minimal. SIE’s batching absorbs this overhead.

Can I combine LoRA with quantisation? Yes — QLoRA (Quantised LoRA) quantises the base model to 4-bit precision and adds LoRA adapters in full precision. This is a common approach for fine-tuning large models on consumer hardware.

How much domain-specific training data do I need? LoRA adapters can be effective with as few as hundreds of (query, document) pairs. More data helps, but the low parameter count means LoRA is significantly less data-hungry than full fine-tuning.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.