---
title: How SIE processes inference requests
description: SIE routes every request through a multi-stage pipeline covering preprocessing, cost-based batching, GPU inference, and postprocessing. Understanding this flow helps you tune throughput, debug latency, and optimise GPU utilisation.
canonical_url: https://superlinked.com/docs/engine
last_updated: 2026-05-20
---

**SIE processes every request through a multi-stage pipeline: preprocessing, cost-based batching, GPU inference, and postprocessing.** Understanding this flow helps you tune throughput, debug latency issues, and get the most out of your GPU.

The pipeline handles tokenisation, dynamic batching across concurrent requests, model-specific inference via the best available compute engine, and output transforms like MUVERA and quantisation.

For a visual overview, see the [Architecture page](https://superlinked.com/docs/engine/architecture/).

---

## How Does SIE Preprocess Requests?

Before any model inference, SIE prepares inputs into the format each model expects.

**Tokenisation** converts raw text into token IDs using the model's tokenizer. Sequence lengths are calculated here, which feeds directly into the batching cost model.

**Image processing** resizes and normalises images for vision models. The specific transform depends on the model's expected input format.

---

## How Does Cost-Based Batching Work?

Rather than batching by item count, SIE batches by estimated compute cost. This avoids wasting GPU cycles padding short sequences to the length of one long outlier.

**How it works:**
1. Each item's token sequence length is used to estimate its compute cost.
2. Items are grouped into batches up to a configurable cost ceiling.
3. Sequences within a batch are padded to the longest item in that batch only, not to the global maximum.

**Cost semantics:** Cost is proportional to sequence length squared (due to attention). A 512-token sequence costs roughly four times as much as a 256-token sequence.

**Padding optimisation:** By grouping items of similar length, SIE minimises padding overhead. This is especially valuable for workloads with variable-length documents.

---

## How Does GPU Inference Work?

SIE abstracts over multiple compute engines and selects the best one per model automatically.

**Cross-request batching:** The server collects items from concurrent API requests and batches them together into a single GPU forward pass. This significantly improves GPU utilisation under load compared to processing each request in isolation.

**Worker execution:** Each model runs in a dedicated worker process. Workers load the model on first request and stay warm in memory until evicted by the LRU policy. The router directs requests to the correct worker by model name. See [Router](https://superlinked.com/docs/engine/router/) and [Model Adapters](https://superlinked.com/docs/engine/adapters/).

---

## What Postprocessing Does SIE Apply?

After GPU inference, SIE applies output transforms before returning results.

**MUVERA transform** converts multi-vector (ColBERT) outputs into a fixed-size representation compatible with standard ANN indexes. This makes multi-vector search practical without specialised infrastructure.

**Quantisation** optionally compresses float32 embeddings to int8 or binary format to reduce storage and speed up ANN search. See [Quantisation](https://superlinked.com/docs/encode/quantization/).

---

## How Does SIE Manage GPU Memory?

SIE uses LRU (least recently used) eviction to share one GPU across many models without requiring them all to be loaded simultaneously.

**Pressure threshold:** When GPU memory usage crosses a configurable threshold, SIE begins evicting models.

**Eviction strategy:** The least recently used model is evicted first. Models with large VRAM footprints are prioritised for eviction when memory pressure is high.

**LRU tracking:** Every inference call updates the model's last-used timestamp. Frequently used models stay resident. Infrequently used models are evicted and reloaded on demand.

---

## How Do I Read the Timing Breakdown?

Every SIE response includes a `timing` field that breaks down where time was spent:

| Stage | What it measures |
| --- | --- |
| `queue_time` | Time waiting for a GPU worker slot |
| `preprocess_time` | Tokenisation and image processing |
| `batch_wait_time` | Time waiting for other requests to join the batch |
| `inference_time` | Actual GPU forward pass |
| `postprocess_time` | MUVERA, quantisation, response serialisation |
| `total_time` | End-to-end wall time |

Use `inference_time` to assess model performance. Use `queue_time` and `batch_wait_time` to assess server load and concurrency. See [Performance Tuning](https://superlinked.com/docs/deployment/tuning/) for guidance on using these metrics.

---

## Frequently Asked Questions

**What compute engines does SIE support?**
SIE wraps PyTorch, SGLang, and Flash Attention. The server selects the best engine for each model automatically based on model type and hardware. You do not configure this directly.

**Can I add my own models to SIE?**
Yes. SIE supports custom model adapters. See [Adding Models](https://superlinked.com/docs/engine/adding-models/) and [Model Adapters](https://superlinked.com/docs/engine/adapters/).

**How does SIE handle LoRA adapters?**
SIE supports LoRA adapters for fine-tuned model variants. See [LoRA Adapters](https://superlinked.com/docs/engine/lora/).

**What is a model bundle in SIE?**
A bundle is a packaged combination of a base model and its configuration, making it easy to deploy a specific model variant consistently. See [Bundles](https://superlinked.com/docs/engine/bundles/).