Skip to content
Why did we open-source our inference engine? Read the post

Private fine-tuned compliance RAG

A regulatory-intelligence RAG stack that does two things stock embedding servers can’t.

Hot-loaded LoRA encoder. The base answerdotai/ModernBERT-base model lives on the GPU once. A named profile (us-regulatory) flips on the sugiv/modernbert-us-stablecoin-encoder LoRA adapter (8.77 MB, r=16, α=32) at request time and produces a domain-adapted 768-dim embedding. No separate deployment, no separate container, no model swap. Add another domain by adding another profile block to a YAML.

Custom cross-encoder that reranks and prunes in one pass. A PruningHead MLP (525K params) sits on top of a frozen BAAI/bge-reranker-v2-m3. The classifier output is the rerank score for sie.score(); the per-token hidden states become keep / drop probabilities exposed through sie.extract(). So one forward pass does both, and the surviving spans become the LLM context. Average compression: 74% character-count reduction on the reranked passages.

Everything needed to host this lives in server-plugin/ as a thin Docker-baked extension on top of the public sie-server image.

Regulatory RAG pipeline: custom LoRA encoder, cross-encoder reranker, and token-level pruner, all from one SIE cluster

Detail per stage:

  • Encode: ModernBERT-base with the us-regulatory profile, which hot-loads the sugiv/modernbert-us-stablecoin-encoder LoRA weights at request time. Produces a 768-dim domain-adapted embedding.
  • Dense retrieval: in-memory cosine similarity over the corpus, top-5 kept.
  • Score / Rerank: sugiv/stablebridge-pruner-highlighter cross-encoder, keeps top-3.
  • Extract / Prune: same Stablebridge model, second primitive. Returns token-level keep/drop probabilities aggregated into highlight / kept / pruned spans.
  • LLM context: the surviving highlight and kept spans become the compressed context passed downstream. Averages 74% compression vs. the raw reranked passages.

Most OSS inference servers assume you are running off-the-shelf models. Real teams don’t. They fine-tune encoders on their domain, they train pruner heads to cut LLM context costs, they mix and match. This example shows the full path (extend the server, register new models, hit them from the SDK) using a regulatory-intelligence use case built on public-good data.

Two model additions drive the pipeline:

ModelBaseTaskWhat it adds
sugiv/modernbert-us-stablecoin-encoderanswerdotai/ModernBERT-baseencodeLoRA adapter (r=16, α=32, 8.77 MB) fine-tuned on US stablecoin regulations. Hot-loaded via the us-regulatory profile on the base model, no separate deployment.
sugiv/stablebridge-pruner-highlighterBAAI/bge-reranker-v2-m3score, extractPruningHead MLP (525K params) on top of the frozen reranker. Produces rerank scores and token-level keep/drop probabilities in one forward pass.
FeatureHow it’s used here
encodeDomain-adapted dense embeddings (ModernBERT + LoRA)
scoreCross-encoder reranking of retrieved candidates
extractToken-level pruning + sentence-level highlight spans
profilesus-regulatory profile activates LoRA weights at request time
custom adapterStablebridgePrunerAdapter extends sie_server.adapters.ModelAdapter to add pruning under the extract primitive
cost-based batchingSIE batches by token count, handling variable-length regulatory docs
model sharingEncoder + pruner share one GPU via SIE’s LRU memory management

Everything the pipeline needs on the server side is packaged in server-plugin/: the patch, the adapter, the YAMLs.

# From this directory
docker build -t sie-regulatory --build-arg SIE_TAG=latest-cuda12-default ./server-plugin
docker run --gpus all -p 8080:8080 sie-regulatory
# CPU-only also works for the tiny sample corpus:
# docker build -t sie-regulatory --build-arg SIE_TAG=latest-cpu-default ./server-plugin
# docker run -p 8080:8080 sie-regulatory

See the server-plugin README for what is in the image and how to extend it further.

# No dependencies to install; the client uses stdlib urllib.
python rag_pipeline.py

Options:

--url URL SIE server URL (default: http://localhost:8080)
--query TEXT Custom query (default: runs all sample regulatory queries)
--top-k N Candidates from dense retrieval (default: 5)
--output PATH Save results as JSON
--quiet Minimal output

Benchmark results (RTX PRO 6000 Blackwell, 98GB VRAM)

Section titled “Benchmark results (RTX PRO 6000 Blackwell, 98GB VRAM)”
OperationMean Latencyp95 LatencyNotes
Encode (base)20 ms25 ms768-dim dense embedding
Encode (LoRA)23 ms27 ms+3 ms for LoRA adapter switch
Score (2 candidates)15 ms17 msCross-encoder reranking
Score (10 candidates)19 ms21 msSub-linear scaling
Extract (single doc)17 ms19 msPruning + highlighting
E2E Pipeline61 ms66 msEncode, then Score(5), then Extract(1)

100% correct ranking on relevant vs. irrelevant passages. Average 74% character-count compression on the final context vs. the raw reranked passages. On smaller GPUs (A10G, L4) expect 3 to 4 times these numbers.

regulatory-rag/
├── rag_pipeline.py # 3-stage RAG pipeline (stdlib only)
├── sample_corpus.json # 12 US regulatory passages
├── README.md
└── server-plugin/
├── README.md # What the plugin does, how to extend it
├── Dockerfile # Builds sie-server with the extensions baked in
├── encode_lora_routing.patch
├── adapters/
│ └── stablebridge_pruner/
│ └── __init__.py # Custom ModelAdapter, 659 lines
└── models/
├── answerdotai__ModernBERT-base.yaml
└── sugiv__stablebridge-pruner-highlighter.yaml

LoRA as a profile. SIE serves LoRA adapters by loading the base model once and activating LoRA weights via named profiles. When the pipeline calls encode("answerdotai/ModernBERT-base", profile="us-regulatory"), SIE applies the sugiv/modernbert-us-stablecoin-encoder weights to the shared base, with no separate deployment or rebuild. Swap in another LoRA by adding another profile block to the model YAML.

The pruner as an adapter. StablebridgePrunerAdapter wraps a frozen BAAI/bge-reranker-v2-m3 with a trained PruningHead MLP (1024 to 512 to 1). It exposes both score() and extract() from the same forward pass. The classifier output becomes the rerank score, the per-token hidden states become keep/drop probabilities. That is a new kind of primitive that would not exist in a stock embedding server.

Entities are just SIE entities. The pruner returns semantic labels you can filter on downstream:

  • highlight (score ≥ 0.9): directly answers the query
  • kept (score ≥ 0.6): supporting context worth preserving
  • pruned (score < 0.6): can be safely removed
  • summary: compression statistics

Built by @sugix as part of the SIE alpha tester program. LoRA training used answerdotai/ModernBERT-base on a curated corpus of US stablecoin regulation; pruner head trained on BEIR-style relevance judgments.

Apache 2.0, same as SIE.

By Sugi Venugeethan.

Contact us

Tell us about your use case and we'll get back to you shortly.