How to evaluate model quality and performance in SIE
SIE’s eval system measures two things: whether models produce correct outputs, and whether they do so within latency targets. Every supported model has baseline targets saved in its config. CI checks current results against those targets and fails when they drift, catching regressions before they reach production.
Why Does SIE Include Evals?
Section titled “Why Does SIE Include Evals?”Models break silently. A dependency update, a driver change, or a code refactor can degrade embedding quality without triggering any errors. SIE solves this with benchmark-driven development:
- Capture targets. Run evals on a trusted source and save results as baseline targets in model configs.
- Check in CI. Automated pipelines compare current results against saved targets on every change.
- Fail on drift. If quality drops below 99% of target, or latency exceeds 250% of target, CI fails.
This approach catches regressions before they affect your search quality in production.
What Is the Difference Between Quality and Performance Evals?
Section titled “What Is the Difference Between Quality and Performance Evals?”| Type | Metrics | When to run |
|---|---|---|
quality | ndcg@10, map@10, f1, precision, recall | After model changes or dependency updates |
perf | p50/p99 latency (ms), throughput (tok/s) | After infrastructure changes or config updates |
Quality evals verify that model outputs match expected retrieval or extraction results. Performance evals verify that latency SLAs and throughput targets are being met.
How Do I Run Evals With the CLI?
Section titled “How Do I Run Evals With the CLI?”SIE includes sie-bench, invoked through mise run eval:
# Quality evaluationmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Performance evaluationmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare SIE against TEImise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE against published MTEB benchmark scoresmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmarkCommon CLI Options
Section titled “Common CLI Options”| Option | Description |
|---|---|
-t, --task | Namespaced task (for example, mteb/NFCorpus or beir/SciFact) |
--type | Evaluation type: quality or perf |
-s, --sources | Comma-separated sources to compare (default: sie) |
-b, --batch-size | Batch size for performance evaluation (default: 1) |
-c, --concurrency | Concurrency level (default: 16) |
-p, --profile | Named profile from model config (for example, sparse or muvera) |
--save-targets | Save results from a source as baseline targets |
--check-targets | Exit non-zero if results fall below targets |
How Do I Save and Check Targets in CI?
Section titled “How Do I Save and Check Targets in CI?”Capture baseline targets from a trusted source:
# Save SIE results as targetsmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie
# Save measurements for regression detectionmise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sieRun regression checks in CI:
# Check against saved targets (99% threshold)mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets
# Check against past measurements (98% threshold, tighter margins)mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurementsWhat Eval Sources Are Available?
Section titled “What Eval Sources Are Available?”Sources determine where results come from. The eval harness starts and stops servers automatically.
| Source | Description |
|---|---|
sie | SIE inference server (default) |
tei | Text Embeddings Inference by HuggingFace |
infinity | Infinity embedding server |
benchmark | Published scores from the MTEB leaderboard |
targets | Saved targets from the model config |
measurements | Past SIE measurements from the model config |
Frequently Asked Questions
Section titled “Frequently Asked Questions”What is NDCG and why does SIE use it? NDCG (Normalised Discounted Cumulative Gain) measures retrieval quality by rewarding systems that return relevant documents higher in the result list. It is the standard metric on the MTEB benchmark, which makes it straightforward to compare SIE results directly against published model scores.
How do I evaluate models on my own data? SIE supports custom eval tasks. See Custom Evals for instructions on defining tasks against your own corpus and queries.
What happens when an eval fails in CI?
The --check-targets flag makes sie-bench exit with a non-zero code when results fall below 99% of saved targets. Your CI pipeline should treat this as a build failure. See Quality Evaluation for details.
Can I compare SIE against OpenAI or TEI on my benchmarks?
Yes. Pass -s sie,tei or -s sie,benchmark to compare sources side by side. The eval harness manages server lifecycle automatically. See Performance Evaluation for a walkthrough.