Skip to content
Why did we open-source our inference engine? Read the post

How SIE processes inference requests

SIE processes every request through a multi-stage pipeline: preprocessing, cost-based batching, GPU inference, and postprocessing. Understanding this flow helps you tune throughput, debug latency issues, and get the most out of your GPU.

The pipeline handles tokenisation, dynamic batching across concurrent requests, model-specific inference via the best available compute engine, and output transforms like MUVERA and quantisation.

For a visual overview, see the Architecture page.


Before any model inference, SIE prepares inputs into the format each model expects.

Tokenisation converts raw text into token IDs using the model’s tokenizer. Sequence lengths are calculated here, which feeds directly into the batching cost model.

Image processing resizes and normalises images for vision models. The specific transform depends on the model’s expected input format.


Rather than batching by item count, SIE batches by estimated compute cost. This avoids wasting GPU cycles padding short sequences to the length of one long outlier.

How it works:

  1. Each item’s token sequence length is used to estimate its compute cost.
  2. Items are grouped into batches up to a configurable cost ceiling.
  3. Sequences within a batch are padded to the longest item in that batch only, not to the global maximum.

Cost semantics: Cost is proportional to sequence length squared (due to attention). A 512-token sequence costs roughly four times as much as a 256-token sequence.

Padding optimisation: By grouping items of similar length, SIE minimises padding overhead. This is especially valuable for workloads with variable-length documents.


SIE abstracts over multiple compute engines and selects the best one per model automatically.

Cross-request batching: The server collects items from concurrent API requests and batches them together into a single GPU forward pass. This significantly improves GPU utilisation under load compared to processing each request in isolation.

Worker execution: Each model runs in a dedicated worker process. Workers load the model on first request and stay warm in memory until evicted by the LRU policy. The router directs requests to the correct worker by model name. See Router and Model Adapters.


After GPU inference, SIE applies output transforms before returning results.

MUVERA transform converts multi-vector (ColBERT) outputs into a fixed-size representation compatible with standard ANN indexes. This makes multi-vector search practical without specialised infrastructure.

Quantisation optionally compresses float32 embeddings to int8 or binary format to reduce storage and speed up ANN search. See Quantisation.


SIE uses LRU (least recently used) eviction to share one GPU across many models without requiring them all to be loaded simultaneously.

Pressure threshold: When GPU memory usage crosses a configurable threshold, SIE begins evicting models.

Eviction strategy: The least recently used model is evicted first. Models with large VRAM footprints are prioritised for eviction when memory pressure is high.

LRU tracking: Every inference call updates the model’s last-used timestamp. Frequently used models stay resident. Infrequently used models are evicted and reloaded on demand.


Every SIE response includes a timing field that breaks down where time was spent:

StageWhat it measures
queue_timeTime waiting for a GPU worker slot
preprocess_timeTokenisation and image processing
batch_wait_timeTime waiting for other requests to join the batch
inference_timeActual GPU forward pass
postprocess_timeMUVERA, quantisation, response serialisation
total_timeEnd-to-end wall time

Use inference_time to assess model performance. Use queue_time and batch_wait_time to assess server load and concurrency. See Performance Tuning for guidance on using these metrics.


What compute engines does SIE support? SIE wraps PyTorch, SGLang, and Flash Attention. The server selects the best engine for each model automatically based on model type and hardware. You do not configure this directly.

Can I add my own models to SIE? Yes. SIE supports custom model adapters. See Adding Models and Model Adapters.

How does SIE handle LoRA adapters? SIE supports LoRA adapters for fine-tuned model variants. See LoRA Adapters.

What is a model bundle in SIE? A bundle is a packaged combination of a base model and its configuration, making it easy to deploy a specific model variant consistently. See Bundles.

Contact us

Tell us about your use case and we'll get back to you shortly.