---
title: Private fine-tuned compliance RAG
description: A regulatory RAG pipeline that hot-loads a domain LoRA at request time and reranks plus prunes context in one forward pass.
canonical_url: https://superlinked.com/docs/examples/regulatory-intelligence-rag
last_updated: 2026-05-20
---

<LinkCard title="View on GitHub" description="examples/regulatory-rag" href="https://github.com/superlinked/sie/tree/main/examples/regulatory-rag" />

## What this is

A regulatory-intelligence RAG stack that does two things stock embedding servers can't.

**Hot-loaded LoRA encoder.** The base `answerdotai/ModernBERT-base` model lives on the GPU once. A named profile (`us-regulatory`) flips on the `sugiv/modernbert-us-stablecoin-encoder` LoRA adapter (8.77 MB, r=16, α=32) at request time and produces a domain-adapted 768-dim embedding. No separate deployment, no separate container, no model swap. Add another domain by adding another profile block to a YAML.

**Custom cross-encoder that reranks and prunes in one pass.** A `PruningHead` MLP (525K params) sits on top of a frozen `BAAI/bge-reranker-v2-m3`. The classifier output is the rerank score for `sie.score()`; the per-token hidden states become keep / drop probabilities exposed through `sie.extract()`. So one forward pass does both, and the surviving spans become the LLM context. Average compression: 74% character-count reduction on the reranked passages.

Everything needed to host this lives in [`server-plugin/`](https://github.com/superlinked/sie/tree/main/examples/regulatory-rag/server-plugin) as a thin Docker-baked extension on top of the public `sie-server` image.

## The pipeline

<div style="background:#ffffff;padding:12px;border-radius:8px;margin:1rem 0;">
  ![Regulatory RAG pipeline: custom LoRA encoder, cross-encoder reranker, and token-level pruner, all from one SIE cluster](https://raw.githubusercontent.com/superlinked/sie/main/examples/regulatory-rag/assets/pipeline.svg)
</div>

Detail per stage:

- **Encode**: `ModernBERT-base` with the `us-regulatory` profile, which hot-loads the `sugiv/modernbert-us-stablecoin-encoder` LoRA weights at request time. Produces a 768-dim domain-adapted embedding.
- **Dense retrieval**: in-memory cosine similarity over the corpus, top-5 kept.
- **Score / Rerank**: `sugiv/stablebridge-pruner-highlighter` cross-encoder, keeps top-3.
- **Extract / Prune**: same Stablebridge model, second primitive. Returns token-level keep/drop probabilities aggregated into `highlight` / `kept` / `pruned` spans.
- **LLM context**: the surviving `highlight` and `kept` spans become the compressed context passed downstream. Averages 74% compression vs. the raw reranked passages.

## Why this example exists

Most OSS inference servers assume you are running off-the-shelf models. Real teams don't. They fine-tune encoders on their domain, they train pruner heads to cut LLM context costs, they mix and match. This example shows the full path (**extend the server, register new models, hit them from the SDK**) using a regulatory-intelligence use case built on public-good data.

Two model additions drive the pipeline:

| Model | Base | Task | What it adds |
|-------|------|------|--------------|
| [`sugiv/modernbert-us-stablecoin-encoder`](https://huggingface.co/sugiv/modernbert-us-stablecoin-encoder) | `answerdotai/ModernBERT-base` | encode | LoRA adapter (r=16, α=32, 8.77 MB) fine-tuned on US stablecoin regulations. Hot-loaded via the `us-regulatory` profile on the base model, no separate deployment. |
| [`sugiv/stablebridge-pruner-highlighter`](https://huggingface.co/sugiv/stablebridge-pruner-highlighter) | `BAAI/bge-reranker-v2-m3` | score, extract | `PruningHead` MLP (525K params) on top of the frozen reranker. Produces rerank scores *and* token-level keep/drop probabilities in one forward pass. |

## SIE features demonstrated

| Feature | How it's used here |
|---------|-------------------|
| **encode** | Domain-adapted dense embeddings (ModernBERT + LoRA) |
| **score** | Cross-encoder reranking of retrieved candidates |
| **extract** | Token-level pruning + sentence-level highlight spans |
| **profiles** | `us-regulatory` profile activates LoRA weights at request time |
| **custom adapter** | `StablebridgePrunerAdapter` extends `sie_server.adapters.ModelAdapter` to add pruning under the `extract` primitive |
| **cost-based batching** | SIE batches by token count, handling variable-length regulatory docs |
| **model sharing** | Encoder + pruner share one GPU via SIE's LRU memory management |

## Quick start

### 1. Build a custom sie-server image

Everything the pipeline needs on the server side is packaged in [`server-plugin/`](https://github.com/superlinked/sie/tree/main/examples/regulatory-rag/server-plugin): the patch, the adapter, the YAMLs.

```bash
# From this directory
docker build -t sie-regulatory --build-arg SIE_TAG=latest-cuda12-default ./server-plugin
docker run --gpus all -p 8080:8080 sie-regulatory
# CPU-only also works for the tiny sample corpus:
# docker build -t sie-regulatory --build-arg SIE_TAG=latest-cpu-default ./server-plugin
# docker run -p 8080:8080 sie-regulatory
```

See the [server-plugin README](https://github.com/superlinked/sie/blob/main/examples/regulatory-rag/server-plugin/README.md) for what is in the image and how to extend it further.

### 2. Run the pipeline

```bash
# No dependencies to install; the client uses stdlib urllib.
python rag_pipeline.py
```

Options:

```
--url URL      SIE server URL (default: http://localhost:8080)
--query TEXT   Custom query (default: runs all sample regulatory queries)
--top-k N      Candidates from dense retrieval (default: 5)
--output PATH  Save results as JSON
--quiet        Minimal output
```

## Benchmark results (RTX PRO 6000 Blackwell, 98GB VRAM)

| Operation | Mean Latency | p95 Latency | Notes |
|-----------|-------------|-------------|-------|
| Encode (base)          | 20 ms | 25 ms | 768-dim dense embedding |
| Encode (LoRA)          | 23 ms | 27 ms | +3 ms for LoRA adapter switch |
| Score (2 candidates)   | 15 ms | 17 ms | Cross-encoder reranking |
| Score (10 candidates)  | 19 ms | 21 ms | Sub-linear scaling |
| Extract (single doc)   | 17 ms | 19 ms | Pruning + highlighting |
| **E2E Pipeline**       | **61 ms** | **66 ms** | Encode, then Score(5), then Extract(1) |

100% correct ranking on relevant vs. irrelevant passages. Average 74% character-count compression on the final context vs. the raw reranked passages. On smaller GPUs (A10G, L4) expect 3 to 4 times these numbers.

## Files

```
regulatory-rag/
├── rag_pipeline.py          # 3-stage RAG pipeline (stdlib only)
├── sample_corpus.json       # 12 US regulatory passages
├── README.md
└── server-plugin/
    ├── README.md            # What the plugin does, how to extend it
    ├── Dockerfile           # Builds sie-server with the extensions baked in
    ├── encode_lora_routing.patch
    ├── adapters/
    │   └── stablebridge_pruner/
    │       └── __init__.py  # Custom ModelAdapter, 659 lines
    └── models/
        ├── answerdotai__ModernBERT-base.yaml
        └── sugiv__stablebridge-pruner-highlighter.yaml
```

## Architecture notes

**LoRA as a profile.** SIE serves LoRA adapters by loading the base model once and activating LoRA weights via named profiles. When the pipeline calls `encode("answerdotai/ModernBERT-base", profile="us-regulatory")`, SIE applies the `sugiv/modernbert-us-stablecoin-encoder` weights to the shared base, with no separate deployment or rebuild. Swap in another LoRA by adding another profile block to the model YAML.

**The pruner as an adapter.** `StablebridgePrunerAdapter` wraps a frozen `BAAI/bge-reranker-v2-m3` with a trained `PruningHead` MLP (1024 to 512 to 1). It exposes both `score()` and `extract()` from the same forward pass. The classifier output becomes the rerank score, the per-token hidden states become keep/drop probabilities. That is a new kind of primitive that would not exist in a stock embedding server.

**Entities are just SIE entities.** The pruner returns semantic labels you can filter on downstream:

- **`highlight`** (score ≥ 0.9): directly answers the query
- **`kept`** (score ≥ 0.6): supporting context worth preserving
- **`pruned`** (score < 0.6): can be safely removed
- **`summary`**: compression statistics

## Credits

Built by [@sugix](https://github.com/sugix) as part of the SIE alpha tester program. LoRA training used `answerdotai/ModernBERT-base` on a curated corpus of US stablecoin regulation; pruner head trained on BEIR-style relevance judgments.

## License

Apache 2.0, same as SIE.

By Sugi Venugeethan.