---
title: Quality Evaluation
description: Run MTEB benchmarks to measure and verify embedding quality.
canonical_url: https://superlinked.com/docs/evals/quality
last_updated: 2026-05-20
---

Quality evaluation runs MTEB tasks against your SIE server. It measures retrieval quality using standard metrics like NDCG@10 and MAP@10.

## MTEB/BEIR Tasks

Source: [packages/sie_bench/src/sie_bench/eval/runner.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/eval/runner.py)

SIE supports all MTEB retrieval tasks. Tasks use a namespace format with optional subset filtering.

```bash
# Standard MTEB retrieval tasks
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
mise run eval BAAI/bge-m3 -t mteb/NanoFiQA2018Retrieval --type quality

# BEIR namespace
mise run eval BAAI/bge-m3 -t beir/SciFact --type quality

# Multilingual tasks with language subset
mise run eval BAAI/bge-m3 -t mteb/Vidore3HrRetrieval/english --type quality
```

Common retrieval tasks:

| Task | Domain | Size | Description |
|------|--------|------|-------------|
| mteb/NFCorpus | Medical | 3.6K docs | Biomedical literature retrieval |
| mteb/NanoFiQA2018Retrieval | Finance | 57K docs | Financial question answering |
| beir/SciFact | Scientific | 5K docs | Claim verification |
| mteb/MSMARCO | Web | 8.8M docs | Web search queries |

## Running Quality Evals

Source: [packages/sie_bench/src/sie_bench/cli.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/cli.py)

Run quality evaluation with the `--type quality` flag. The eval harness starts and stops servers automatically.

```bash
# Basic quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality

# Evaluate with a specific profile
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --profile sparse
```

Output shows scores for each metric:

```
## Evaluating BAAI/bge-m3 on mteb/NFCorpus (quality)
Sources: sie

Source   ndcg_at_10  map_at_10   mrr_at_10
sie      0.3144      0.1174      0.5243
```

## Comparing Sources

Compare SIE against other inference backends or published benchmarks using the `-s` flag.

```bash
# Compare SIE vs TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei

# Compare SIE vs published MTEB leaderboard scores
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark

# Compare SIE vs stored targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets
```

Available sources:

| Source | Description |
|--------|-------------|
| `sie` | SIE server (started automatically) |
| `tei` | HuggingFace Text Embeddings Inference |
| `infinity` | Infinity inference server |
| `fastembed` | FastEmbed library |
| `benchmark` | Published MTEB leaderboard scores |
| `targets` | Stored targets from model config |
| `measurements` | Past SIE measurements from model config |

## Targets in Configs

Source: [packages/sie_bench/src/sie_bench/targets/capture.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/targets/capture.py)

Each model config stores quality targets under the `targets.quality` section. Targets come from authoritative sources like the MTEB leaderboard or comparison runs.

```yaml
# packages/sie_server/models/BAAI__bge-m3.yaml
targets:
  quality:
    mteb-leaderboard/mteb/NFCorpus:
      ndcg_at_10: 0.3141
      map_at_10: 0.1172
      mrr_at_10: 0.5232
```

The key format is `source/namespace/task` where source identifies origin (e.g., `mteb-leaderboard`, `tei@1.8.3`).

Measurements from SIE runs are stored separately under `measurements.quality`:

```yaml
measurements:
  quality:
    sie@11a9c5d/default/mteb/NFCorpus:
      ndcg_at_10: 0.31437
      map_at_10: 0.11743
      mrr_at_10: 0.5243
```

## Saving Targets

Capture results from any source and save them as targets using `--save-targets`.

```bash
# Save TEI results as quality targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets tei

# Save MTEB benchmark scores as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets benchmark
```

Save SIE results as measurements (for tracking your own baselines):

```bash
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie
```

Saved metrics include `ndcg_at_10`, `map_at_10`, and `mrr_at_10`. The source identifier and git commit hash are recorded for traceability.

## CI Integration

Use `--check-targets` in CI to catch quality regressions. The command exits non-zero if SIE scores fall below targets.

```bash
# CI command: fails if quality regresses
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets
```

SIE must achieve at least 99% of the target score (configurable via `quality_margin`). Example output:

```
  PASS: ndcg_at_10: 0.3144 >= 0.3110 (target: 0.3141)
  PASS: map_at_10: 0.1174 >= 0.1160 (target: 0.1172)

Target check PASSED
```

For stricter regression detection against past SIE runs, use `--check-measurements`:

```bash
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements
```

This uses a 98% margin, detecting regressions in your own implementation.

## What's Next

- [Performance Evaluation](/docs/evals/performance/) - Measure throughput and latency
- [Model Catalog](/models) - Supported models and their targets
