How SIE processes inference requests

SIE processes every request through a multi-stage pipeline: preprocessing, cost-based batching, GPU inference, and postprocessing. In Kubernetes, the SIE server sidecar inside each worker pod owns queue intake and batch formation before it calls the Python sie-server adapter process. In Docker, one sie-server process runs the same broad stages in-process.

The pipeline handles tokenisation, dynamic batching across concurrent requests, model-specific inference via the best available compute engine, and output transforms like MUVERA and quantisation.

For a visual overview, see the Architecture page.

How Does SIE Preprocess Requests?

Before any model inference, SIE prepares inputs into the format each model expects.

Tokenisation converts raw text into token IDs using the model’s tokenizer. Sequence lengths are calculated here, which feeds directly into the batching cost model.

Image processing resizes and normalises images for vision models. The specific transform depends on the model’s expected input format.

How Does Cost-Based Batching Work?

Rather than batching by item count, SIE batches by estimated compute cost. This avoids wasting GPU cycles padding short sequences to the length of one long outlier.

In queue-mode clusters, the SIE server sidecar forms batches by model, operation, and LoRA key, then sends a RunBatch IPC call to the sie-server adapter. Standalone sie-server uses the Python batcher inside the same process.

How it works:

Each item’s token sequence length is used to estimate its compute cost.
Items are grouped into batches up to a configurable cost ceiling.
Sequences within a batch are padded to the longest item in that batch only, not to the global maximum.

Cost semantics: Cost is proportional to sequence length squared (due to attention). A 512-token sequence costs roughly four times as much as a 256-token sequence.

Padding optimisation: By grouping items of similar length, SIE minimises padding overhead. This is especially valuable for workloads with variable-length documents.

How Does GPU Inference Work?

SIE abstracts over multiple compute engines and selects the best one per model automatically.

Cross-request batching: The server collects items from concurrent API requests and batches them together into a single GPU forward pass. This significantly improves GPU utilisation under load compared to processing each request in isolation.

Worker execution: In Kubernetes, each GPU worker pod has the SIE server sidecar and a Python sie-server adapter container. The gateway publishes work to the pool’s JetStream stream; the sidecar pulls and batches it; the sie-server adapter loads the model on first use and runs inference. Models stay warm in memory until evicted by the LRU policy. See Gateway and Model Adapters.

What Postprocessing Does SIE Apply?

After GPU inference, SIE applies output transforms before returning results.

MUVERA transform converts multi-vector (ColBERT) outputs into a fixed-size representation compatible with standard ANN indexes. This makes multi-vector search practical without specialised infrastructure.

Quantisation optionally compresses float32 embeddings to int8 or binary format to reduce storage and speed up ANN search. See Quantisation.

How Does SIE Manage GPU Memory?

SIE uses LRU (least recently used) eviction to share one GPU across many models without requiring them all to be loaded simultaneously.

Pressure threshold: When GPU memory usage crosses a configurable threshold, SIE begins evicting models.

Eviction strategy: The least recently used model is evicted first. Models with large VRAM footprints are prioritised for eviction when memory pressure is high.

LRU tracking: Every inference call updates the model’s last-used timestamp. Frequently used models stay resident. Infrequently used models are evicted and reloaded on demand.

How Do I Read the Timing Breakdown?

Every SIE response includes a timing field that breaks down where time was spent:

Stage	What it measures
`queue_time`	Time waiting for a GPU worker slot
`preprocess_time`	Tokenisation and image processing
`batch_wait_time`	Time waiting for other requests to join the batch
`inference_time`	Actual GPU forward pass
`postprocess_time`	MUVERA, quantisation, response serialisation
`total_time`	End-to-end wall time

Use inference_time to assess model performance. Use queue_time and batch_wait_time to assess server load and concurrency. See Performance Tuning for guidance on using these metrics.

Frequently Asked Questions

What compute engines does SIE support? SIE wraps PyTorch, SGLang, and Flash Attention. The server selects the best engine for each model automatically based on model type and hardware. You do not configure this directly.

Can I add my own models to SIE? Yes. SIE supports custom model adapters. See Adding Models and Model Adapters.

How does SIE handle LoRA adapters? SIE supports LoRA adapters for fine-tuned model variants. See LoRA Adapters.

What is a model bundle in SIE? A bundle is a packaged combination of a base model and its configuration, making it easy to deploy a specific model variant consistently. See Bundles.