Building production inference: routing, batching, model configs, and LoRA in one cluster
One system handles all four.
The Superlinked Inference Engine (SIE) puts routing in a stateless gateway, batching in the worker pods, model configuration in a single-writer control plane, and LoRA adapters in a per-request option.
Your application keeps calling encode, score, and extract, and the cluster does the production work underneath.
It is open source: github.com/superlinked/sie.
Each of the four is one section below.
How do I manage routing, batching, model configs, and LoRA adapters for production inference?
SIE assigns each concern to one component: routing to a stateless gateway, batching to the worker pods, model configuration to a single-writer control plane, and LoRA to a per-request option. You operate one cluster and your code keeps calling three functions.
Routing
A stateless Rust gateway sits between clients and GPU worker pods. Per request it resolves the model, bundle, machine profile, and resource pool from an in-memory registry, then publishes the work to a NATS JetStream queue. It also tracks worker health from heartbeats and isolates capacity with resource pools.
You never configure routes by hand. You name a model, the gateway places it:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://your-gateway:8080")client.encode("BAAI/bge-m3", Item(text="routed by the gateway"))If the target pool is scaled to zero, the gateway returns 202 Accepted with a Retry-After header, and the SDK waits for capacity when you set wait_for_capacity=True. That is how scale-from-zero stays invisible to your code.
Batching
Batching happens next to the GPU. Each worker pod runs a sidecar that pulls work from its queue and groups requests into batches by model, operation, and LoRA key, then hands fully formed batches to the model-execution process over local IPC. Keying on those three fields is what keeps batches correct: only requests for the same model, same operation, and same adapter combine.
Your lever is request shape. Send items in lists so the server can batch them:
client.encode("NovaSearch/stella_en_400M_v5", [Item(text=c) for c in chunks])Batch-size and concurrency tuning live in Performance Tuning.
Model configs
Configuration is owned by sie-config, the authoritative control plane that runs as a single writer, persists model configs, and publishes runtime deltas while gateways and worker pods converge asynchronously. Keeping writes off the hot path is the reason the inference edge stays stateless.
curl -X POST http://your-cluster/v1/configs/models \ -H "Content-Type: application/json" \ -d '{ "model_id": "your-org/your-encoder", "...": "..." }'Gateways bootstrap from a full snapshot via GET /v1/configs/export, subscribe to live deltas, and poll GET /v1/configs/epoch to recover anything missed. To version-control this, see the Config GitOps workflow.
LoRA adapters
Adapters are a per-request option, never a separate deployment. You pass the adapter name on the call; the base model loads once and is shared, the adapter applies on top, and the batching layer keys on the adapter so different adapters batch separately.
client.encode("BAAI/bge-m3", Item(text="indemnification"), options={"lora": "legal"})The model-execution process owns model loading, LoRA loading, and memory-pressure eviction. Full details: LoRA Adapters.
How the four fit together
Client SDK -> sie-gateway routing: resolves model, bundle, profile, pool -> NATS JetStream queue -> worker sidecar batching: groups by model, operation, LoRA key -> sie-server execution: model + LoRA loading, GPU inference
Admin -> sie-config model configs: single-writer control planeFour responsibilities, one cluster to operate. Stand it up with the same Helm chart used everywhere else:
helm upgrade --install sie-cluster oci://ghcr.io/superlinked/charts/sie-cluster \ --namespace sie --create-namespace \ --set hfToken.create=true \ --set hfToken.value=<TOKEN> \ -f deploy/helm/sie-cluster/values-gke.yamlFurther reading, in order of depth: