How to evaluate model quality and performance in SIE

SIE’s eval system measures two things: whether models produce correct outputs, and whether they do so within latency targets. Every supported model has baseline targets saved in its config. CI checks current results against those targets and fails when they drift, catching regressions before they reach production.

Why Does SIE Include Evals?

Models break silently. A dependency update, a driver change, or a code refactor can degrade embedding quality without triggering any errors. SIE solves this with benchmark-driven development:

Capture targets. Run evals on a trusted source and save results as baseline targets in model configs.
Check in CI. Automated pipelines compare current results against saved targets on every change.
Fail on drift. If quality drops below 99% of target, or latency exceeds 250% of target, CI fails.

This approach catches regressions before they affect your search quality in production.

What Is the Difference Between Quality and Performance Evals?

Type	Metrics	When to run
`quality`	ndcg@10, map@10, f1, precision, recall	After model changes or dependency updates
`perf`	p50/p99 latency (ms), throughput (tok/s)	After infrastructure changes or config updates

Quality evals verify that model outputs match expected retrieval or extraction results. Performance evals verify that latency SLAs and throughput targets are being met.

How Do I Run Evals With the CLI?

SIE includes sie-bench, invoked through mise run eval:

# Quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality

# Performance evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf

# Compare SIE against TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei

# Compare SIE against published MTEB benchmark scores
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark

Common CLI Options

Option	Description
`-t, --task`	Namespaced task (for example, `mteb/NFCorpus` or `beir/SciFact`)
`--type`	Evaluation type: `quality` or `perf`
`-s, --sources`	Comma-separated sources to compare (default: `sie`)
`-b, --batch-size`	Batch size for performance evaluation (default: 1)
`-c, --concurrency`	Concurrency level (default: 16)
`-p, --profile`	Named profile from model config (for example, `sparse` or `muvera`)
`--save-targets`	Save results from a source as baseline targets
`--check-targets`	Exit non-zero if results fall below targets

How Do I Save and Check Targets in CI?

Capture baseline targets from a trusted source:

# Save SIE results as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie

# Save measurements for regression detection
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie

Run regression checks in CI:

# Check against saved targets (99% threshold)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets

# Check against past measurements (98% threshold, tighter margins)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements

What Eval Sources Are Available?

Sources determine where results come from. The eval harness starts and stops servers automatically.

Source	Description
`sie`	SIE inference server (default)
`tei`	Text Embeddings Inference by HuggingFace
`infinity`	Infinity embedding server
`benchmark`	Published scores from the MTEB leaderboard
`targets`	Saved targets from the model config
`measurements`	Past SIE measurements from the model config

Frequently Asked Questions

What is NDCG and why does SIE use it? NDCG (Normalised Discounted Cumulative Gain) measures retrieval quality by rewarding systems that return relevant documents higher in the result list. It is the standard metric on the MTEB benchmark, which makes it straightforward to compare SIE results directly against published model scores.

How do I evaluate models on my own data? SIE supports custom eval tasks. See Custom Evals for instructions on defining tasks against your own corpus and queries.

What happens when an eval fails in CI? The --check-targets flag makes sie-bench exit with a non-zero code when results fall below 99% of saved targets. Your CI pipeline should treat this as a build failure. See Quality Evaluation for details.

Can I compare SIE against OpenAI or TEI on my benchmarks? Yes. Pass -s sie,tei or -s sie,benchmark to compare sources side by side. The eval harness manages server lifecycle automatically. See Performance Evaluation for a walkthrough.