---
title: Performance Tuning
description: Optimize SIE for your workload with batching, memory, and inference settings.
canonical_url: https://superlinked.com/docs/deployment/tuning
last_updated: 2026-05-20
---

SIE provides several tuning parameters that affect throughput, latency, and resource usage. This guide covers the main configuration options.

## Batching Parameters

Source: [packages/sie_server/src/sie_server/core/batcher.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/batcher.py)

Batching groups requests to maximize GPU utilization. Three parameters control batch formation:

### max_batch_cost

Maximum total cost per batch. For text, cost equals token count. Default: 16384 tokens.

Batch cost is an internal default in `BatchConfig` and is configured per-model, not via environment variable.

### max_batch_wait_ms

Maximum time to wait for more requests before processing a batch. Default: 10ms.

```bash
# Environment variable
export SIE_MAX_BATCH_WAIT_MS=20
```

Lower values reduce latency for sparse traffic. Higher values improve batching efficiency under load.

### max_batch_requests

Maximum number of requests per batch. Default: 64.

```bash
# Environment variable
export SIE_MAX_BATCH_REQUESTS=128
```

This is a secondary limit. Cost-based batching typically triggers first for text workloads.

### Tuning Strategy

For low-latency workloads, reduce `max_batch_wait_ms` to 5ms or less. For high-throughput batch processing, increase `max_batch_wait_ms` and `max_batch_requests`.

## Memory Thresholds

Source: [packages/sie_server/src/sie_server/core/memory.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/memory.py)

SIE uses reactive LRU eviction to manage GPU memory. No static VRAM budget is required.

### Pressure Threshold

When memory usage exceeds this percentage, the least-recently-used model is evicted. Default: 85%.

```bash
# Environment variable
export SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT=85
```

Lower values keep more headroom for inference spikes. Higher values allow more models to stay loaded.

### How Eviction Works

The memory manager checks pressure at two points:

1. **Before loading**: If above threshold, evict LRU model first
2. **After each batch**: Background check for gradual memory growth

Models are tracked by last-use time. The oldest model is evicted first.

```python
# From memory.py - LRU tracking
def touch(self, model_name: str) -> None:
    if model_name in self._models:
        self._models[model_name].touch()
        self._models.move_to_end(model_name)
```

### Device-Specific Behavior

Memory tracking adapts to your hardware:

| Device | Memory Source |
|--------|---------------|
| CUDA | NVML device memory query |
| MPS | PyTorch allocated memory |
| CPU | System RAM via psutil |

## Attention Backend

Source: [packages/sie_server/src/sie_server/core/inference.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/inference.py)

The attention implementation affects inference speed significantly.

### Available Backends

| Backend | Requirements | Speedup |
|---------|-------------|---------|
| `flash_attention_2` | Ampere+ GPU, flash-attn package | 2-4x |
| `sdpa` | PyTorch 2.0+ | 1.5-2x |
| `eager` | Any | Baseline |

### Configuration

```bash
# Auto-select best available (default)
export SIE_ATTENTION_BACKEND=auto

# Force specific backend
export SIE_ATTENTION_BACKEND=flash_attention_2
export SIE_ATTENTION_BACKEND=sdpa
```

Auto mode selects Flash Attention 2 if available, then SDPA, then eager.

### Flash Attention Requirements

Flash Attention 2 requires:

- CUDA compute capability 8.0+ (Ampere: A100, RTX 30xx, RTX 40xx)
- The `flash-attn` package installed
- FP16 or BF16 compute precision (not FP32)

If requirements are not met, the server falls back to SDPA automatically.

## Compute Precision

Control the precision used for model inference:

```bash
# Options: float16, bfloat16, float32
export SIE_DEFAULT_COMPUTE_PRECISION=float16
```

| Precision | Memory | Speed | Compatibility |
|-----------|--------|-------|---------------|
| `float16` | Low | Fast | All CUDA GPUs |
| `bfloat16` | Low | Fast | Ampere+, MPS, CPU |
| `float32` | High | Slow | All devices |

BF16 offers better numerical stability than FP16 for some models. FP32 is mainly for debugging.

## Preprocessing Workers

Source: [packages/sie_server/src/sie_server/core/preprocessor_registry.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/preprocessor_registry.py)

Tokenization and image processing run in a CPU thread pool.

```bash
# Environment variable
export SIE_PREPROCESSOR_WORKERS=8
```

Default: `4`. Increase for high request rates. Decrease on memory-constrained systems.

The thread pool is shared across all models. Both tokenization and image preprocessing use the same pool.

## Environment Variables

All tuning parameters can be set via environment variables with the `SIE_` prefix:

| Variable | Default | Description |
|----------|---------|-------------|
| `SIE_MAX_BATCH_REQUESTS` | 64 | Max requests per batch |
| `SIE_MAX_BATCH_WAIT_MS` | 10 | Max wait time (ms) |
| `SIE_MAX_CONCURRENT_REQUESTS` | 512 | Request queue size |
| `SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT` | 85 | Eviction trigger (%) |
| `SIE_PREPROCESSOR_WORKERS` | 4 | CPU thread pool size |
| `SIE_ATTENTION_BACKEND` | auto | Attention implementation |
| `SIE_DEFAULT_COMPUTE_PRECISION` | float16 | Model precision |

## Benchmarking Changes

Use the eval harness to measure the impact of tuning changes:

```bash
# Performance benchmark
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie

# Compare before/after
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,targets
```

The perf eval reports throughput (items/sec), latency percentiles, and GPU utilization.

See the [Evals documentation](/docs/evals/) for the full benchmarking workflow.

## What's Next

- [Request Lifecycle](/docs/engine/) - how batching and memory work together
- [Evals](/docs/evals/) - benchmark your configuration changes