---
title: Custom Evals
description: Create evaluation tasks for your own data with MTEB v2 format.
canonical_url: https://superlinked.com/docs/evals/custom
last_updated: 2026-05-20
---

Custom evals let you benchmark models on your domain-specific data. Create tasks in MTEB v2 format and run them alongside standard benchmarks.

## Custom Task Format

Source: [packages/sie_bench/src/sie_bench/tasks/loader.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/tasks/loader.py)

Custom tasks use the MTEB v2 format with three files:

| File | Format | Description |
|------|--------|-------------|
| `corpus.jsonl` | `{"_id": "doc1", "title": "optional", "text": "document text"}` | Documents to search |
| `queries.jsonl` | `{"_id": "q1", "text": "query text"}` | Queries to evaluate |
| `qrels/test.tsv` | `query-id<tab>corpus-id<tab>score` | Relevance judgments (0-3 scale) |

Example `corpus.jsonl`:
```json
{"_id": "doc1", "title": "ML Basics", "text": "Machine learning uses algorithms to learn from data."}
{"_id": "doc2", "text": "The weather forecast predicts rain tomorrow."}
```

Example `queries.jsonl`:
```json
{"_id": "q1", "text": "What is machine learning?"}
{"_id": "q2", "text": "How do neural networks work?"}
```

Example `qrels/test.tsv`:
```
q1	doc1	3
q1	doc2	0
```

Scores follow TREC conventions: 3 = highly relevant, 2 = relevant, 1 = marginally relevant, 0 = not relevant.

## Task Namespaces

Source: [packages/sie_bench/src/sie_bench/tasks/namespace.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/tasks/namespace.py)

Tasks use namespace prefixes to identify their source:

| Namespace | Description | Example |
|-----------|-------------|---------|
| `mteb/` | MTEB built-in tasks | `mteb/NFCorpus` |
| `beir/` | BEIR benchmark tasks (via MTEB) | `beir/SciFact` |
| `custom/` | Custom tasks from `evals/` directory | `custom/my-domain-task` |

The `custom/` namespace maps to the `evals/` directory in your project root.

## Adding Custom Tasks

Source: [packages/sie_bench/src/sie_bench/tasks/loader.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/tasks/loader.py)

Create a directory structure under `evals/`:

```
evals/
  my-domain-task/
    corpus.jsonl
    queries.jsonl
    qrels/
      test.tsv
```

Run your custom task with either path syntax:

```bash
# Using custom/ namespace prefix
mise run eval BAAI/bge-m3 -t custom/my-domain-task --type quality

# Using direct path
mise run eval BAAI/bge-m3 -t evals/my-domain-task --type quality
```

The loader auto-detects custom tasks by checking for the `custom/` prefix or `evals/` path.

### Multiple Splits

The loader looks for qrels in this order: `test.tsv`, `train.tsv`, `dev.tsv`, then falls back to `qrels.tsv` at the task root.

```
evals/
  my-task/
    corpus.jsonl
    queries.jsonl
    qrels/
      test.tsv   # Used by default
      train.tsv  # For training set evaluation
      dev.tsv    # For development set
```

### TREC Format Support

Both 3-column and 4-column (TREC) qrels formats are supported:

```
# 3-column: query-id, corpus-id, score
q1	doc1	3

# 4-column (TREC): query-id, 0, corpus-id, score
q1	0	doc1	3
```

## W&B Integration

Source: [packages/sie_bench/src/sie_bench/tracking/wandb.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/tracking/wandb.py)

Log evaluation results to Weights & Biases for experiment tracking. W&B is ideal for comparing model configurations and A/B testing.

```bash
# Basic logging
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --wandb-project sie-evals

# With team/entity
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --wandb-project sie-evals --wandb-entity my-team
```

Install wandb first:

```bash
pip install wandb
wandb login
```

**W&B dashboard tips:**
- Filter by model tag to compare different models
- Use parallel coordinates to visualize metric trade-offs
- Compare runs with different LoRA adapters
- Filter by task tag to see performance on specific benchmarks

## MLflow Integration

Source: [packages/sie_bench/src/sie_bench/tracking/mlflow.py](https://github.com/superlinked/sie/blob/main/packages/sie_bench/src/sie_bench/tracking/mlflow.py)

Log to MLflow for self-hosted experiment tracking. MLflow works with local storage or a remote tracking server.

```bash
# Local tracking (saves to ./mlruns)
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --mlflow-experiment embedding-evals

# Remote MLflow server
mise run eval BAAI/bge-m3 -t mteb/NFCorpus --type quality \
  --mlflow-experiment embedding-evals \
  --mlflow-uri http://mlflow.internal:5000
```

Install mlflow first:

```bash
pip install mlflow
```

**MLflow notes:**
- Parameters are flattened automatically (nested dicts become dot notation)
- Artifacts are stored in the configured artifact store (local, S3, GCS, or Azure)
- Run URLs only work with a tracking server, not local file storage

## What's Next

- [Evals Overview](/docs/evals/) - benchmark-driven development philosophy
- [Performance Evals](/docs/evals/performance/) - latency and throughput benchmarks
