---
title: Performance Evaluation
description: Measure latency, throughput, and scalability of embedding models.
canonical_url: https://superlinked.com/docs/evals/performance
last_updated: 2026-05-20
---

Performance evaluation measures how fast models process requests under various conditions. The benchmark harness tracks latency percentiles, throughput in tokens per second, and behavior under load.

## Performance Metrics

Source: [packages/sie_bench/src/sie_bench/report.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/report.py)

SIE captures the following metrics during performance benchmarks:

| Metric | Description |
|--------|-------------|
| **p50 latency** | Median response time in milliseconds |
| **p90/p95/p99 latency** | Tail latency percentiles for SLA planning |
| **min/max latency** | Range of observed response times |
| **tokens/sec** | Processing throughput for corpus and query workloads |
| **items/sec** | Request throughput (tokens/sec divided by average sequence length) |

Corpus throughput measures document encoding speed. Query throughput measures short-text encoding with `is_query=True`.

## Running Performance Evals

Source: [packages/sie_bench/src/sie_bench/cli.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/cli.py)

Use `--type perf` to run performance benchmarks instead of quality evaluations:

```bash
# Performance benchmark on SIE
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf

# Compare SIE vs TEI performance
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie,tei

# Save results as baseline measurements
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf -s sie --save-measurements sie
```

The eval harness automatically starts and stops servers. Do not manually start Docker containers.

### Performance Options

| Option | Default | Description |
|--------|---------|-------------|
| `--batch-size` | 1 | Items per request |
| `--concurrency` | 16 | Number of parallel clients |
| `--device` | cuda:0 | Inference device |
| `--timeout` | 120.0 | Request timeout in seconds |

## Load Testing

Source: [packages/sie_bench/src/sie_bench/loadtest.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/loadtest.py)

For sustained load testing against a cluster, use the `loadtest` command with a YAML scenario file:

```bash
mise run sie-bench -- loadtest scenario.yaml --cluster http://gateway:8080
```

The load test harness provides live progress display showing:
- Current and target requests per second
- Rolling p50 and p99 latency
- Success and error counts
- Per-model request distribution

### Scenario Configuration

```yaml
# scenario.yaml
models:
  - BAAI/bge-m3
  - NovaSearch/stella_en_400M_v5
gpu_types:
  - l4
load_profile:
  pattern: constant
  target_rps: 100
concurrency: 32
duration_s: 300
warmup_s: 30
batch_size: 1
timeout_s: 30.0
```

## Load Patterns

Source: [packages/sie_bench/src/sie_bench/loadtest.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/loadtest.py)

The load test harness supports four traffic patterns:

| Pattern | Behavior |
|---------|----------|
| **constant** | Fixed RPS throughout the test duration |
| **ramp** | Gradually increase from 0 to target RPS over `ramp_duration_s` |
| **step** | Step-wise increase at 25%, 50%, 75%, and 100% of target RPS |
| **spike** | Normal traffic with periodic spikes at `spike_multiplier` intensity |

### Pattern Examples

```yaml
# Constant load at 100 RPS
load_profile:
  pattern: constant
  target_rps: 100

# Ramp from 0 to 200 RPS over 60 seconds
load_profile:
  pattern: ramp
  target_rps: 200
  ramp_duration_s: 60

# Step through 25/50/75/100 RPS
load_profile:
  pattern: step
  target_rps: 100
  step_levels: [0.25, 0.5, 0.75, 1.0]

# Normal at 50 RPS with 3x spikes every 60 seconds
load_profile:
  pattern: spike
  target_rps: 50
  spike_multiplier: 3.0
  spike_duration_s: 10
  spike_interval_s: 60
```

## Matrix Evaluation

Source: [packages/sie_bench/src/sie_bench/matrix/config.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/matrix/config.py)

Matrix evaluation runs benchmarks across multiple models, tasks, and GPU types in parallel:

```bash
mise run sie-bench -- matrix config.yaml --cluster http://gateway:8080 --workers 2
```

### Matrix Configuration

```yaml
# matrix-config.yaml
models:
  - BAAI/bge-m3
  - model: NovaSearch/stella_en_400M_v5
    profiles: all
  - bundle: default
tasks:
  - mteb/NFCorpus
  - mteb/SciFact
gpus:
  - l4
  - a100-80gb
type:
  - quality
  - perf
perf:
  batch_size: 1
  concurrency: 16
  timeout: 120.0
```

Matrix mode creates isolated resource pools per GPU type and runs evaluations concurrently.

### Model Specifications

Models can be specified in three ways:

| Format | Example | Description |
|--------|---------|-------------|
| String | `BAAI/bge-m3` | Single model with default profile |
| Dict with profiles | `{model: bge-m3, profiles: all}` | Model with specific or all profiles |
| Bundle | `{bundle: default}` | All models in a bundle |

## Load Test Reports

Source: [packages/sie_bench/src/sie_bench/loadtest_report.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/loadtest_report.py)

After a load test completes, the harness generates Markdown and JSON reports:

```bash
mise run sie-bench -- loadtest scenario.yaml --cluster http://gateway:8080 --output ./reports
```

Reports include:
- Configuration summary
- Overall request counts and success rate
- Latency percentiles (p50, p90, p95, p99, min, max, mean)
- Throughput in requests/sec and items/sec
- Per-model breakdown for multi-model tests
- ASCII time-series graphs for throughput and p99 latency
- Error breakdown by type

## What's Next

- [Quality Evaluation](/docs/evals/quality/) - Measure retrieval accuracy with NDCG and MAP metrics
- [SDK Reference](/docs/reference/sdk/) - Client options for timeout and batch configuration
