---
title: Architecture
description: High-level architecture of SIE from client SDK to GPU inference.
canonical_url: https://superlinked.com/docs/engine/architecture
last_updated: 2026-05-20
---

SIE is a layered system: SDK clients talk to a gateway (or directly to a single server), workers execute inference, and a dedicated config service owns runtime model configuration.

## System Overview

![SIE system architecture: Client, Gateway, and Worker layers](/diagrams/system-arch.svg)

In production Kubernetes deployments, the hot path is intentionally separate from the config control plane:

```text
Client SDK
  -> sie-gateway (Rust, stateless inference edge)
  -> NATS JetStream queue
  -> sie-server workers
  -> NATS Core result inbox
  -> sie-gateway response

Admin tooling
  -> sie-config (Python, single-writer config control plane)
  -> config store + NATS config deltas
  -> gateways and workers converge asynchronously
```

---

## Components

### Client SDK

Source: [packages/sie_sdk/src/sie_sdk/client/sync.py](https://github.com/superlinked/sie/blob/main/packages/sie_sdk/src/sie_sdk/client/sync.py)

The SDK provides `encode()`, `score()`, and `extract()` methods. It handles:

- **msgpack serialization**: Binary wire format, faster and smaller than JSON
- **Automatic 202 retry**: Waits for scale-from-zero with `wait_for_capacity=True`
- **Pool management**: Background lease renewal for resource pools
- **Numpy integration**: Returns native numpy arrays for embeddings

Framework integrations (LangChain, LlamaIndex, etc.) wrap the SDK with framework-specific interfaces.

### Gateway

Source: [packages/sie_gateway/src/handlers/proxy.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/proxy.rs)

The gateway is a stateless Rust service that sits between clients and workers. It is optional for single-server setups but required for elastic Kubernetes clusters.

**Responsibilities:**
- Resolves model, bundle, machine profile, and pool from its in-memory registry
- Publishes inference work to NATS JetStream instead of proxying directly over HTTP
- Returns `202 Accepted` with `Retry-After` when the target worker group is scaled to zero
- Serves read-side config endpoints from its local registry mirror
- Manages resource pools for capacity isolation
- Tracks worker health and bundle config hashes from WebSocket status streams

The gateway does not own config writes. `POST /v1/configs/models`, `GET /v1/configs/export`, and `GET /v1/configs/epoch` belong to `sie-config`.

### Config Service

Source: [packages/sie_config/src/sie_config/config_api.py](https://github.com/superlinked/sie/blob/main/packages/sie_config/src/sie_config/config_api.py)

`sie-config` is the authoritative config control plane. It runs as a single writer, persists API-added model configs, and publishes runtime config deltas:

- `POST /v1/configs/models` appends new models or profiles.
- `GET /v1/configs/export` gives gateways a full snapshot for bootstrap and drift recovery.
- `GET /v1/configs/epoch` exposes the authoritative model-write epoch and bundle-set hash.
- `GET /v1/configs/bundles{,/{id}}` lets gateways fetch the bundle set baked into the `sie-config` image.

Gateways bootstrap from `sie-config`, subscribe to `sie.config.models._all` for live deltas, and poll `/v1/configs/epoch` to recover any missed NATS messages.

### Worker (sie-server)

Source: [packages/sie_server/src/sie_server/main.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/main.py)

Each worker is a single-GPU inference server running the full pipeline:

1. **Preprocess**: Tokenization and image processing (CPU thread pool)
2. **Batch**: Cost-based batching by token count
3. **GPU Inference**: Model forward pass via adapter (PyTorch, Flash Attention, SGLang)
4. **Postprocess**: Quantization, MUVERA transform (CPU thread pool)

Workers manage multiple models on one GPU with LRU eviction when memory pressure exceeds the threshold.

In cluster mode, workers consume queue messages from JetStream and publish results back to the originating gateway over NATS Core. They also subscribe to bundle-scoped config subjects like `sie.config.models.default` so runtime model additions reach the workers directly.

---

## Wire Protocol

Source: [packages/sie_sdk/src/sie_sdk/client/sync.py](https://github.com/superlinked/sie/blob/main/packages/sie_sdk/src/sie_sdk/client/sync.py)

SIE uses **msgpack** as the default wire format instead of JSON:

| Format | Encode speed | Decode speed | Size | Numpy support |
|--------|-------------|-------------|------|---------------|
| msgpack | Fast | Fast | ~50% of JSON | Native via msgpack-numpy |
| JSON | Slower | Slower | Baseline | Requires list conversion |

The SDK sends and receives msgpack automatically. The OpenAI-compatible `/v1/embeddings` endpoint uses JSON for compatibility.

Inside a Kubernetes cluster, gateway-to-worker work items and worker-to-gateway results are msgpack as well. JSON is reserved for low-frequency control-plane APIs and client requests that explicitly negotiate JSON.

---

## Model Cache Hierarchy

Source: [packages/sie_server/src/sie_server/core/model_loader.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/model_loader.py)

Model weights are resolved through a 3-tier cache:

![Model cache hierarchy: Local Cache, Cluster Cache, HuggingFace Hub](/diagrams/cache-hierarchy.svg)

**Local disk cache** uses LRU eviction when disk usage exceeds `SIE_DISK_PRESSURE_THRESHOLD_PERCENT` (default: 85%).

**Cluster cache** is useful for Kubernetes deployments where multiple workers share the same S3/GCS bucket, avoiding redundant downloads from HuggingFace.

---

## Deployment Modes

### Standalone (Direct)

```
Client → sie-server (single GPU)
```

Simplest setup. Client connects directly to one server. Good for development and small production.

### Multi-Bundle (Docker Compose)

```
Client → sie-server:8080 (default bundle)
Client → sie-server:8081 (sglang bundle)
```

Multiple containers, each with a different bundle. Client routes to the correct port.

### Cluster (Kubernetes)

```
Client → sie-gateway → NATS JetStream → worker pool(s)
Admin → sie-config → config store + NATS config deltas
```

Full production setup with GPU routing, autoscaling, and observability. See [Kubernetes in GCP](/docs/deployment/cloud-gcp/) or [AWS](/docs/deployment/cloud-aws/).

---

## What's Next

- [Request Pipeline](/docs/engine/) - detailed preprocessing, batching, and GPU inference flow
- [Gateway](/docs/engine/router/) - routing, queueing, load balancing, and resource pools
- [Config API](/docs/engine/config-api/) - runtime model config writes and readiness polling
- [Adapters](/docs/engine/adapters/) - compute engine abstraction layer