---
title: "Building production inference: routing, batching, model configs, and LoRA in one cluster"
description: SIE handles routing in a stateless gateway, batching in worker pods, model configuration in a single-writer control plane, and LoRA adapters as a per-request option.
canonical_url: https://superlinked.com/blog/routing-batching-configs-lora-for-production-inference
last_updated: 2026-06-16
---

**One system handles all four.**

The Superlinked Inference Engine (SIE) puts routing in a stateless gateway, batching in the worker pods, model configuration in a single-writer control plane, and LoRA adapters in a per-request option.

Your application keeps calling `encode`, `score`, and `extract`, and the cluster does the production work underneath.

*It is open source: [github.com/superlinked/sie](https://github.com/superlinked/sie)*.

Each of the four is one section below.

<BlogSieCta />

## How do I manage routing, batching, model configs, and LoRA adapters for production inference?

SIE assigns each concern to one component: routing to a stateless gateway, batching to the worker pods, model configuration to a single-writer control plane, and LoRA to a per-request option. You operate one cluster and your code keeps calling three functions.

## Routing

A stateless Rust gateway sits between clients and GPU worker pods. Per request it resolves the model, bundle, machine profile, and resource pool from an in-memory registry, then publishes the work to a NATS JetStream queue. It also tracks worker health from heartbeats and isolates capacity with resource pools.

You never configure routes by hand. You name a model, the gateway places it:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://your-gateway:8080")
client.encode("BAAI/bge-m3", Item(text="routed by the gateway"))
```

If the target pool is scaled to zero, the gateway returns `202 Accepted` with a `Retry-After` header, and the SDK waits for capacity when you set `wait_for_capacity=True`. That is how scale-from-zero stays invisible to your code.

## Batching

Batching happens next to the GPU. Each worker pod runs a sidecar that pulls work from its queue and groups requests into batches by model, operation, and LoRA key, then hands fully formed batches to the model-execution process over local IPC. Keying on those three fields is what keeps batches correct: only requests for the same model, same operation, and same adapter combine.

Your lever is request shape. Send items in lists so the server can batch them:

```python
client.encode("NovaSearch/stella_en_400M_v5", [Item(text=c) for c in chunks])
```

Batch-size and concurrency tuning live in [Performance Tuning](/docs/deployment/tuning).

## Model configs

Configuration is owned by `sie-config`, the authoritative control plane that runs as a single writer, persists model configs, and publishes runtime deltas while gateways and worker pods converge asynchronously. Keeping writes off the hot path is the reason the inference edge stays stateless.

```bash
curl -X POST http://your-cluster/v1/configs/models \
  -H "Content-Type: application/json" \
  -d '{ "model_id": "your-org/your-encoder", "...": "..." }'
```

Gateways bootstrap from a full snapshot via `GET /v1/configs/export`, subscribe to live deltas, and poll `GET /v1/configs/epoch` to recover anything missed. To version-control this, see the [Config GitOps workflow](/docs/deployment/config-gitops).

## LoRA adapters

Adapters are a per-request option, never a separate deployment. You pass the adapter name on the call; the base model loads once and is shared, the adapter applies on top, and the batching layer keys on the adapter so different adapters batch separately.

```python
client.encode("BAAI/bge-m3", Item(text="indemnification"), options={"lora": "legal"})
```

The model-execution process owns model loading, LoRA loading, and memory-pressure eviction. Full details: [LoRA Adapters](/docs/engine/lora).

## How the four fit together

```
Client SDK
  -> sie-gateway      routing: resolves model, bundle, profile, pool
  -> NATS JetStream   queue
  -> worker sidecar   batching: groups by model, operation, LoRA key
  -> sie-server       execution: model + LoRA loading, GPU inference

Admin -> sie-config   model configs: single-writer control plane
```

Four responsibilities, one cluster to operate. Stand it up with the same Helm chart used everywhere else:

```bash
helm upgrade --install sie-cluster oci://ghcr.io/superlinked/charts/sie-cluster \
  --namespace sie --create-namespace \
  --set hfToken.create=true \
  --set hfToken.value=<TOKEN> \
  -f deploy/helm/sie-cluster/values-gke.yaml
```

Further reading, in order of depth:

- [Engine architecture](/docs/engine/architecture)
- [Gateway internals](/docs/engine/router)
- [LoRA adapters](/docs/engine/lora)
- [Repository](https://github.com/superlinked/sie)
