Skip to content
Why did we open-source our inference engine? Read the post

How to evaluate model quality and performance in SIE

SIE’s eval system measures two things: whether models produce correct outputs, and whether they do so within latency targets. Every supported model has baseline targets saved in its config. CI checks current results against those targets and fails when they drift, catching regressions before they reach production.


Models break silently. A dependency update, a driver change, or a code refactor can degrade embedding quality without triggering any errors. SIE solves this with benchmark-driven development:

  1. Capture targets. Run evals on a trusted source and save results as baseline targets in model configs.
  2. Check in CI. Automated pipelines compare current results against saved targets on every change.
  3. Fail on drift. If quality drops below 99% of target, or latency exceeds 250% of target, CI fails.

This approach catches regressions before they affect your search quality in production.


What Is the Difference Between Quality and Performance Evals?

Section titled “What Is the Difference Between Quality and Performance Evals?”
TypeMetricsWhen to run
qualityndcg@10, map@10, f1, precision, recallAfter model changes or dependency updates
perfp50/p99 latency (ms), throughput (tok/s)After infrastructure changes or config updates

Quality evals verify that model outputs match expected retrieval or extraction results. Performance evals verify that latency SLAs and throughput targets are being met.


SIE includes sie-bench, invoked through mise run eval:

# Quality evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality
# Performance evaluation
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type perf
# Compare SIE against TEI
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,tei
# Compare SIE against published MTEB benchmark scores
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,benchmark
OptionDescription
-t, --taskNamespaced task (for example, mteb/NFCorpus or beir/SciFact)
--typeEvaluation type: quality or perf
-s, --sourcesComma-separated sources to compare (default: sie)
-b, --batch-sizeBatch size for performance evaluation (default: 1)
-c, --concurrencyConcurrency level (default: 16)
-p, --profileNamed profile from model config (for example, sparse or muvera)
--save-targetsSave results from a source as baseline targets
--check-targetsExit non-zero if results fall below targets

Capture baseline targets from a trusted source:

# Save SIE results as targets
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-targets sie
# Save measurements for regression detection
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality --save-measurements sie

Run regression checks in CI:

# Check against saved targets (99% threshold)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,targets --check-targets
# Check against past measurements (98% threshold, tighter margins)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality -s sie,measurements --check-measurements

Sources determine where results come from. The eval harness starts and stops servers automatically.

SourceDescription
sieSIE inference server (default)
teiText Embeddings Inference by HuggingFace
infinityInfinity embedding server
benchmarkPublished scores from the MTEB leaderboard
targetsSaved targets from the model config
measurementsPast SIE measurements from the model config

What is NDCG and why does SIE use it? NDCG (Normalised Discounted Cumulative Gain) measures retrieval quality by rewarding systems that return relevant documents higher in the result list. It is the standard metric on the MTEB benchmark, which makes it straightforward to compare SIE results directly against published model scores.

How do I evaluate models on my own data? SIE supports custom eval tasks. See Custom Evals for instructions on defining tasks against your own corpus and queries.

What happens when an eval fails in CI? The --check-targets flag makes sie-bench exit with a non-zero code when results fall below 99% of saved targets. Your CI pipeline should treat this as a build failure. See Quality Evaluation for details.

Can I compare SIE against OpenAI or TEI on my benchmarks? Yes. Pass -s sie,tei or -s sie,benchmark to compare sources side by side. The eval harness manages server lifecycle automatically. See Performance Evaluation for a walkthrough.

Contact us

Tell us about your use case and we'll get back to you shortly.