---
title: HTTP API Reference
description: Complete reference for all SIE HTTP endpoints.
canonical_url: https://superlinked.com/docs/reference/api
last_updated: 2026-05-20
---

This reference documents all HTTP endpoints exposed by the SIE server.

## Endpoint Summary

Source: [packages/sie_server/src/sie_server/api/](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/)

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/v1/encode/:model` | POST | Generate embeddings |
| `/v1/score/:model` | POST | Rerank items |
| `/v1/extract/:model` | POST | Extract entities and structured data |
| `/v1/models` | GET | List available models |
| `/v1/models/:model` | GET | Get model details |
| `/v1/embeddings` | POST | OpenAI-compatible embeddings |
| `/healthz` | GET | Liveness probe |
| `/readyz` | GET | Readiness probe |
| `/metrics` | GET | Prometheus metrics |
| `/ws/status` | WebSocket | Real-time worker status |

## Wire Format

Source: [packages/sie_server/src/sie_server/api/serialization.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/serialization.py)

SIE defaults to **msgpack** for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.

**Content negotiation:**
- `Content-Type: application/msgpack` for requests
- `Accept: application/msgpack` for responses (default)
- `Accept: application/json` falls back to JSON

When using JSON, arrays are converted to lists.

---

## POST /v1/encode/:model

Source: [packages/sie_server/src/sie_server/api/encode.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/encode.py)

Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.

### Request Schema

```python
class EncodeRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to encode
    params: EncodeParams           # Optional: encoding parameters

class EncodeParams(TypedDict, total=False):
    output_types: list[str]        # 'dense', 'sparse', 'multivector'
    instruction: str               # Task instruction for query encoding
    output_dtype: str              # 'float32', 'float16', 'int8', 'binary'
    options: dict[str, Any]        # Profile, LoRA, runtime options

class Item(TypedDict, total=False):
    id: str                        # Client-provided ID (echoed back)
    text: str                      # Text content
    images: list[ImageInput]       # Image bytes with format hint

class ImageInput(TypedDict, total=False):
    data: bytes                    # Image bytes
    format: str                    # 'jpeg', 'png', 'webp'
```

### Response Schema

```python
class EncodeResponse(TypedDict, total=False):
    model: str                     # Model name used
    items: list[EncodeResult]      # One result per input item
    timing: TimingInfo             # Server-side timing breakdown

class EncodeResult(TypedDict, total=False):
    id: str                        # Echoed item ID
    dense: DenseVector             # Dense embedding
    sparse: SparseVector           # Sparse embedding
    multivector: MultiVector       # Per-token embeddings

class DenseVector(TypedDict, total=False):
    dims: int                      # Vector dimensionality
    dtype: str                     # 'float32', 'float16', 'int8', 'binary'
    values: list[float]            # Vector values

class SparseVector(TypedDict, total=False):
    dims: int                      # Vocabulary size
    dtype: str                     # Data type
    indices: list[int]             # Non-zero dimension indices
    values: list[float]            # Values at those indices

class MultiVector(TypedDict, total=False):
    token_dims: int                # Per-token embedding dimension
    num_tokens: int                # Number of tokens
    dtype: str                     # Data type
    values: list[list[float]]      # Token embeddings
```

### Request Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `items` | `list[Item]` | Required | Items to encode |
| `params.output_types` | `list[str]` | `["dense"]` | Output types to return |
| `params.instruction` | `str` | None | Instruction prefix for query encoding |
| `params.output_dtype` | `str` | `"float32"` | Output precision |
| `params.options` | `dict` | None | Runtime options (profile, lora, etc.) |

### Examples

**Basic encoding:**

```bash
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Hello, world!"}]
  }'
```

**Multiple output types:**

```bash
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Search query"}],
    "params": {
      "output_types": ["dense", "sparse"],
      "instruction": "Represent this query for retrieval:"
    }
  }'
```

**Response:**

```json
{
  "model": "BAAI/bge-m3",
  "items": [
    {
      "dense": {
        "dims": 1024,
        "dtype": "float32",
        "values": [0.0234, -0.0891, 0.1234, ...]
      },
      "sparse": {
        "dims": 250002,
        "dtype": "float32",
        "indices": [101, 2023, 5789, ...],
        "values": [0.45, 0.32, 0.28, ...]
      }
    }
  ]
}
```

---

## POST /v1/score/:model

Source: [packages/sie_server/src/sie_server/api/score.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/score.py)

Rerank items against a query using a cross-encoder model.

### Request Schema

```python
class ScoreRequest(TypedDict, total=False):
    query: Item                    # Required: query to score against
    items: list[Item]              # Required: items to score
    instruction: str               # Optional instruction
    options: dict[str, Any]        # Runtime options
```

### Response Schema

```python
class ScoreResponse(TypedDict, total=False):
    model: str
    query_id: str | None           # Echoed query ID
    scores: list[ScoreEntry]       # Sorted by score descending

class ScoreEntry(TypedDict):
    item_id: str | None            # Echoed item ID
    score: float                   # Relevance score
    rank: int                      # Position (0 = most relevant)
```

### Example

```bash
curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "query": {"text": "What is machine learning?"},
    "items": [
      {"id": "doc-1", "text": "ML uses algorithms to learn from data."},
      {"id": "doc-2", "text": "The weather is sunny today."}
    ]
  }'
```

**Response:**

```json
{
  "model": "BAAI/bge-reranker-v2-m3",
  "scores": [
    {"item_id": "doc-1", "score": 0.891, "rank": 0},
    {"item_id": "doc-2", "score": 0.023, "rank": 1}
  ]
}
```

---

## POST /v1/extract/:model

Source: [packages/sie_server/src/sie_server/api/extract.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/extract.py)

Extract structured data from items: entities, relations, classifications, or vision outputs.

### Request Schema

```python
class ExtractRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to extract from
    params: ExtractParams          # Optional: extraction parameters

class ExtractParams(TypedDict, total=False):
    labels: list[str]              # Entity types for NER
    output_schema: dict            # JSON schema for structured extraction
    instruction: str               # Task instruction
    options: dict[str, Any]        # Runtime options (see below)
```

#### Per-request options

`params.options` is an adapter-specific dict. Currently supported keys:

| Key | Type | Default | Scope | Description |
|-----|------|---------|-------|-------------|
| `overflow_policy` | `"default"` \| `"truncate_text"` \| `"error"` | `"default"` | `gliclass-*` family | Controls behavior when `text + label_prompt` exceeds the model's context (512 tokens for `gliclass-{small,base,large}-v1.0`). `default` passes input through as-is (may surface as `INPUT_TOO_LONG` on these models). `truncate_text` truncates the end of `text` to fit while preserving labels. `error` always raises `INPUT_TOO_LONG` on overflow. |

### Response Schema

```python
class ExtractResponse(TypedDict, total=False):
    model: str
    items: list[ExtractResult]

class ExtractResult(TypedDict, total=False):
    id: str
    entities: list[Entity]         # NER results
    relations: list[Relation]      # Relation extraction
    classifications: list[Classification]
    objects: list[DetectedObject]  # Object detection
    data: dict[str, Any]           # Structured extraction results

class Entity(TypedDict, total=False):
    text: str                      # Extracted span
    label: str                     # Entity type
    score: float                   # Confidence (0-1)
    start: int                     # Start character offset
    end: int                       # End character offset
    bbox: list[int]                # Bounding box [x, y, w, h] (images)

class Relation(TypedDict):
    head: str                      # Source entity
    tail: str                      # Target entity
    relation: str                  # Relation type
    score: float                   # Confidence

class Classification(TypedDict):
    label: str                     # Class label
    score: float                   # Probability
```

### Example

```bash
curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Tim Cook is the CEO of Apple Inc."}],
    "params": {
      "labels": ["person", "organization", "role"]
    }
  }'
```

**Response:**

```json
{
  "model": "urchade/gliner_multi-v2.1",
  "items": [
    {
      "id": "item-0",
      "entities": [
        {"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8},
        {"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19},
        {"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32}
      ]
    }
  ]
}
```

**Example with `overflow_policy` on gliclass:**

```bash
curl -X POST http://localhost:8080/v1/extract/knowledgator/gliclass-small-v1.0 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "<long review text...>"}],
    "params": {
      "labels": ["positive", "negative", "neutral"],
      "options": {"overflow_policy": "truncate_text"}
    }
  }'
```

When `overflow_policy` is `"error"` (or `"default"` on `gliclass-{small,base,large}-v1.0` past the context cap), the server returns HTTP 400:

```json
{
  "detail": {
    "code": "INPUT_TOO_LONG",
    "message": "Item 0: observed 612 tokens (text=540, label_prompt=72) exceeds context cap 512 for knowledgator/gliclass-small-v1.0",
    "model": "knowledgator/gliclass-small-v1.0"
  }
}
```

---

## GET /v1/models

Source: [packages/sie_server/src/sie_server/api/models.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/models.py)

List all available models with their capabilities.

### Response Schema

```python
class ModelsListResponse(BaseModel):
    models: list[ModelInfo]

class ModelInfo(BaseModel):
    name: str                      # Model name
    inputs: list[str]              # Supported inputs: text, image
    outputs: list[str]             # Supported outputs: dense, sparse, multivector
    dims: dict[str, int]           # Dimensions per output type
    loaded: bool                   # Whether model is in GPU memory
    max_sequence_length: int       # Maximum tokens
    profiles: dict[str, ProfileInfo]  # Available profiles

class ProfileInfo(BaseModel):
    is_default: bool               # Whether this is the default profile
    output_types: list[str]        # Output types enabled by this profile
    output_similarity: dict[str, str]  # Similarity metrics per output type
```

### Example

```bash
curl -H "Accept: application/json" http://localhost:8080/v1/models
```

**Response:**

```json
{
  "models": [
    {
      "name": "BAAI/bge-m3",
      "inputs": ["text"],
      "outputs": ["dense", "sparse", "multivector"],
      "dims": {"dense": 1024, "sparse": 250002, "multivector": 1024},
      "loaded": true,
      "max_sequence_length": 8192,
      "profiles": {}
    },
    {
      "name": "BAAI/bge-reranker-v2-m3",
      "inputs": ["text"],
      "outputs": ["score"],
      "dims": {},
      "loaded": false,
      "max_sequence_length": 8192,
      "profiles": {}
    }
  ]
}
```

---

## POST /v1/embeddings (OpenAI Compatible)

Source: [packages/sie_server/src/sie_server/api/openai_compat.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/openai_compat.py)

Drop-in replacement for OpenAI's embeddings API.

### Example

```bash
curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "BAAI/bge-m3",
    "input": ["Hello, world!"]
  }'
```

**Response:**

```json
{
  "object": "list",
  "model": "BAAI/bge-m3",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0891, ...]
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 3
  }
}
```

Works with OpenAI SDK, LangChain's `OpenAIEmbeddings`, and other compatible clients.

---

## Health Endpoints

Source: [packages/sie_server/src/sie_server/api/health.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/health.py)

### GET /healthz

Liveness probe. Returns 200 if the server process is running.

```bash
curl http://localhost:8080/healthz
# "ok"
```

### GET /readyz

Readiness probe. Returns 200 if the server is ready to accept traffic.

```bash
curl http://localhost:8080/readyz
# "ok"
```

---

## GET /metrics

Prometheus metrics endpoint.

### Available Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `sie_requests_total` | Counter | model, endpoint, status | Total request count |
| `sie_request_duration_seconds` | Histogram | model, endpoint, phase | Latency by phase |
| `sie_batch_size` | Histogram | model | Batch size distribution |
| `sie_tokens_processed_total` | Counter | model | Total tokens processed |
| `sie_queue_depth` | Gauge | model | Pending items per model |
| `sie_model_loaded` | Gauge | model, device | Model load status (1/0) |
| `sie_model_memory_bytes` | Gauge | model, device | GPU memory per model |

---

## WebSocket /ws/status

Source: [packages/sie_server/src/sie_server/api/ws.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/ws.py)

Real-time worker status stream. Sends updates every 200ms.

### Message Schema

```python
{
    "timestamp": float,            # Unix timestamp
    "gpu": str,                    # GPU type (e.g., "l4", "a100-80gb")
    "loaded_models": list[str],    # Currently loaded models
    "server": {
        "version": str,
        "uptime_seconds": int,
        "user": str,
        "working_dir": str,
        "pid": int
    },
    "gpus": [                      # Per-GPU metrics
        {
            "index": int,
            "name": str,
            "gpu_type": str,       # Normalized type (e.g., "l4", "a100-80gb")
            "utilization_percent": float,
            "memory_used_bytes": int,
            "memory_total_bytes": int,
            "memory_threshold_pct": float,
            "temperature_c": int
        }
    ],
    "models": [                    # Per-model status
        {
            "name": str,
            "state": str,          # "loaded", "loading", "unloading", "available"
            "device": str | None,
            "memory_bytes": int,
            "queue_depth": int,
            "queue_pending_items": int,
            "config": {...}        # Model configuration
        }
    ],
    "counters": {...},             # Prometheus counter metrics
    "histograms": {...}            # Prometheus histogram metrics
}
```

### Usage

```javascript
const ws = new WebSocket("ws://localhost:8080/ws/status");
ws.onmessage = (event) => {
    const status = JSON.parse(event.data);
    console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);
};
```

---

## Error Responses

All endpoints return consistent error responses:

```json
{
  "detail": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'unknown-model' not found"
  }
}
```

### Error Codes

| Code | HTTP Status | Description |
|------|-------------|-------------|
| `MODEL_NOT_FOUND` | 404 | Requested model doesn't exist |
| `INVALID_INPUT` | 400 | Invalid request format |
| `INPUT_TOO_LONG` | 400 | Input exceeds model context (extract endpoint, gliclass family) |
| `MODEL_NOT_LOADED` | 503 | Model is not loaded or still loading |
| `LORA_LOADING` | 503 | LoRA adapter is loading (retry with Retry-After header) |
| `QUEUE_FULL` | 503 | Server overloaded, request queue is full |
| `DEPENDENCY_CONFLICT` | 409 | Model requires different bundle/dependencies |
| `INFERENCE_ERROR` | 500 | Error during model inference |
| `INTERNAL_ERROR` | 500 | Unexpected server error |

---

## Response Headers

Timing and tracing information is included in response headers:

| Header | Description |
|--------|-------------|
| `X-Total-Time` | Total request time (ms) |
| `X-Queue-Time` | Time waiting in queue (ms) |
| `X-Tokenization-Time` | Preprocessing time (ms) |
| `X-Inference-Time` | GPU inference time (ms) |
| `X-Postprocessing-Time` | Postprocessing time (ms), only if > 0 |
| `X-Trace-ID` | OpenTelemetry trace ID for distributed tracing |
