---
title: How to evaluate model quality and performance in SIE
description: "SIE's eval system measures model quality (NDCG, F1, AP) and inference performance (latency, throughput) against saved targets. CI fails automatically when results drift below those targets."
canonical_url: https://superlinked.com/docs/evals
last_updated: 2026-05-20
---

**SIE's eval system measures two things: whether models produce correct outputs, and whether they do so within latency targets.** Every supported model has baseline targets saved in its config. CI checks current results against those targets and fails when they drift, catching regressions before they reach production.

---

## Why Does SIE Include Evals?

Models break silently. A dependency update, a driver change, or a code refactor can degrade embedding quality without triggering any errors. SIE solves this with benchmark-driven development:

1. **Capture targets.** Run evals on a trusted source and save results as baseline targets in model configs.
2. **Check in CI.** Automated pipelines compare current results against saved targets on every change.
3. **Fail on drift.** If quality drops below 99% of target, or latency exceeds 250% of target, CI fails.

This approach catches regressions before they affect your search quality in production.

---

## What Is the Difference Between Quality and Performance Evals?

| Type | Metrics | When to run |
| --- | --- | --- |
| `quality` | ndcg@10, map@10, f1, precision, recall | After model changes or dependency updates |
| `perf` | p50/p99 latency (ms), throughput (tok/s) | After infrastructure changes or config updates |

Quality evals verify that model outputs match expected retrieval or extraction results. Performance evals verify that latency SLAs and throughput targets are being met.

---

## How Do I Run Evals With the CLI?

SIE includes `sie-bench`, invoked through `mise run eval`:

```bash
# Quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality

# Performance evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf

# Compare SIE against TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei

# Compare SIE against published MTEB benchmark scores
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark
```

### Common CLI Options

| Option | Description |
| --- | --- |
| `-t, --task` | Namespaced task (for example, `mteb/NFCorpus` or `beir/SciFact`) |
| `--type` | Evaluation type: `quality` or `perf` |
| `-s, --sources` | Comma-separated sources to compare (default: `sie`) |
| `-b, --batch-size` | Batch size for performance evaluation (default: 1) |
| `-c, --concurrency` | Concurrency level (default: 16) |
| `-p, --profile` | Named profile from model config (for example, `sparse` or `muvera`) |
| `--save-targets` | Save results from a source as baseline targets |
| `--check-targets` | Exit non-zero if results fall below targets |

---

## How Do I Save and Check Targets in CI?

Capture baseline targets from a trusted source:

```bash
# Save SIE results as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie

# Save measurements for regression detection
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie
```

Run regression checks in CI:

```bash
# Check against saved targets (99% threshold)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets

# Check against past measurements (98% threshold, tighter margins)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements
```

---

## What Eval Sources Are Available?

Sources determine where results come from. The eval harness starts and stops servers automatically.

| Source | Description |
| --- | --- |
| `sie` | SIE inference server (default) |
| `tei` | Text Embeddings Inference by HuggingFace |
| `infinity` | Infinity embedding server |
| `benchmark` | Published scores from the MTEB leaderboard |
| `targets` | Saved targets from the model config |
| `measurements` | Past SIE measurements from the model config |

---

## Frequently Asked Questions

**What is NDCG and why does SIE use it?**
NDCG (Normalised Discounted Cumulative Gain) measures retrieval quality by rewarding systems that return relevant documents higher in the result list. It is the standard metric on the MTEB benchmark, which makes it straightforward to compare SIE results directly against published model scores.

**How do I evaluate models on my own data?**
SIE supports custom eval tasks. See [Custom Evals](https://superlinked.com/docs/evals/custom/) for instructions on defining tasks against your own corpus and queries.

**What happens when an eval fails in CI?**
The `--check-targets` flag makes `sie-bench` exit with a non-zero code when results fall below 99% of saved targets. Your CI pipeline should treat this as a build failure. See [Quality Evaluation](https://superlinked.com/docs/evals/quality/) for details.

**Can I compare SIE against OpenAI or TEI on my benchmarks?**
Yes. Pass `-s sie,tei` or `-s sie,benchmark` to compare sources side by side. The eval harness manages server lifecycle automatically. See [Performance Evaluation](https://superlinked.com/docs/evals/performance/) for a walkthrough.
