Gateway
The SIE gateway is a stateless Rust service that sits between clients and GPU workers. It handles routing, queue submission, resource pools, worker health, read-side config, and scale-from-zero orchestration.
The page keeps the /docs/engine/router/ URL for compatibility, but the deployed component is sie-gateway.
When to Use the Gateway
Section titled “When to Use the Gateway”Not every deployment needs a gateway. The deciding factor is whether you are running an elastic worker fleet:
- Single server (local dev, single Docker container): connect the SDK directly to
sie-server. - Kubernetes clusters: use the gateway. It provides a stable client endpoint, worker discovery, queue-based inference, scale-from-zero, resource pools, and config read endpoints.
- Horizontal gateway replicas: supported. Each replica keeps its own in-memory registry and converges through bootstrap, NATS config deltas, and epoch polling.
| Setup | Use Gateway? | Why |
|---|---|---|
| Single Docker container | No | Connect the SDK directly to the worker |
| Docker Compose (multi-worker) | Optional | Useful for a single client endpoint in local tests |
| Kubernetes | Yes | Required for worker discovery, queue routing, scale-from-zero, and pool isolation |
Architecture
Section titled “Architecture”The gateway is stateless with respect to durable data. It owns in-memory routing state, but it does not persist config and it does not execute inference.
Client request -> sie-gateway resolves model, bundle, machine profile, and pool -> gateway publishes msgpack work items to NATS JetStream -> matching workers consume and execute inference -> workers publish msgpack results to the gateway's NATS Core inbox -> gateway assembles and returns the HTTP responseConfig writes are outside this hot path. Admin tooling writes to sie-config, and gateways mirror that state through /v1/configs/export, NATS deltas, and /v1/configs/epoch polling.
Request Routing
Section titled “Request Routing”The gateway resolves every inference request to:
- Model and profile: the model path and optional
:profilesuffix. - Bundle: selected by adapter compatibility, with the lowest numeric bundle priority winning by default.
- Machine profile:
X-SIE-MACHINE-PROFILEheader or SDKgpuparameter. - Pool: default pool or explicit
X-SIE-Pool/ SDKpool/profiletarget. - Queue subject:
sie.work.{model}.{pool}on the pool’s JetStream stream.
Unlike the previous Python router, the Rust gateway is queue-only for inference. There is no direct-HTTP fallback to workers. If the queue transport is unavailable, the gateway returns 503 instead of bypassing the queue.
GPU Routing
Section titled “GPU Routing”Requests can specify a target machine profile:
# HTTPcurl -X POST http://gateway:8080/v1/encode/BAAI/bge-m3 \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "Content-Type: application/json" \ -d '{"items": [{"text": "Hello world"}]}'# SDKresult = client.encode("BAAI/bge-m3", Item(text="hello"), gpu="l4")// SDKconst result = await client.encode("BAAI/bge-m3", { text: "hello" }, { gpu: "l4" });If the caller omits a machine profile, the gateway can use the default configured route. Scale-from-zero returns 202 when the selected (bundle, machine_profile) has no healthy worker and the caller did not pin an explicit pool.
202 Scale-from-Zero
Section titled “202 Scale-from-Zero”When no healthy worker is registered for the selected (bundle, machine_profile) tuple and the caller did not pin a specific pool, the gateway returns:
HTTP/1.1 202 AcceptedRetry-After: 120Content-Type: application/json
{ "status": "provisioning", "gpu": "l4", "bundle": "default", "estimated_wait_s": 180, "message": "No worker available for GPU type 'l4'. Provisioning in progress."}The SDK handles this automatically with wait_for_capacity=True. See Scale-from-Zero for details.
202 is only for capacity provisioning. Unknown models fail fast with 404 once the gateway registry has bootstrapped. Incompatible explicit bundle choices fail with 409.
Worker Discovery
Section titled “Worker Discovery”Static Mode
Section titled “Static Mode”List worker URLs explicitly:
sie-gateway serve \ -w http://worker-1:8080 \ -w http://worker-2:8080 \ -w http://worker-3:8080Kubernetes Mode
Section titled “Kubernetes Mode”Auto-discover workers via Kubernetes service endpoints:
sie-gateway serve \ --kubernetes \ --k8s-namespace sie \ --k8s-service sie-worker \ --k8s-port 8080In Kubernetes mode, the gateway watches endpoint changes and automatically registers or deregisters workers. Worker status is then tracked over WebSocket (/ws/status) so the gateway sees bundle, machine profile, queue depth, loaded models, health, and config hash.
Resource Pools
Section titled “Resource Pools”Resource pools reserve dedicated workers for tenant isolation. Pool workers only serve requests for that pool.
Create a Pool
Section titled “Create a Pool”client = SIEClient("http://gateway:8080")
# Reserve 2 L4 workers for this tenantclient.create_pool("tenant-abc", {"l4": 2})
# Route requests to the poolresult = client.encode( "BAAI/bge-m3", Item(text="hello"), gpu="tenant-abc/l4" # pool_name/gpu_type)
# Check pool statusinfo = client.get_pool("tenant-abc")
# Cleanupclient.delete_pool("tenant-abc")Pool Lifecycle
Section titled “Pool Lifecycle”- Pools are represented in Kubernetes
ConfigMaps andLeases. - The SDK renews pool leases automatically in a background thread.
- Pools expire after their TTL unless renewed.
- The
defaultpool is protected and cannot be deleted.
Config Read Surface
Section titled “Config Read Surface”The gateway serves read-side config endpoints from its in-memory registry:
| Endpoint | Purpose |
|---|---|
GET /v1/configs/models | List models known to this gateway |
GET /v1/configs/models/{id} | Return model YAML from the gateway registry |
GET /v1/configs/models/{id}/status | Report per-replica worker ACK readiness |
GET /v1/configs/bundles | List known bundles and connected worker counts |
GET /v1/configs/bundles/{id} | Return bundle YAML |
POST /v1/configs/resolve | Dry-run model or explicit bundle override to bundle routing |
The gateway is not a config write authority. POST /v1/configs/models is not registered on the gateway and returns 405 Method Not Allowed; send writes to sie-config.
Bootstrap and Recovery
Section titled “Bootstrap and Recovery”On startup, the gateway:
- Optionally loads filesystem seeds from
SIE_BUNDLES_DIRandSIE_MODELS_DIRif an escape-hatch config map is mounted. - Reads
GET /v1/configs/epochto capture the authoritative epoch and bundle-set hash. - Fetches bundles from
sie-configwithGET /v1/configs/bundles{,/{id}}. - Fetches model state with
GET /v1/configs/export. - Subscribes to
sie.config.models._allfor live deltas. - Polls
GET /v1/configs/epochevery 30 seconds to catch missed deltas or bundle-set drift.
/readyz does not wait for sie-config. A fresh gateway can be ready before the first config bootstrap succeeds; during that window, typed requests may return 404 until the registry is populated.
Health & Status
Section titled “Health & Status”The gateway aggregates health from all workers:
| Endpoint | Description |
|---|---|
GET /healthz | Gateway liveness |
GET /readyz | Gateway readiness; intentionally independent of sie-config reachability |
GET /health | Cluster summary: worker count, GPU count, models loaded |
GET /v1/models | Model list from the gateway registry |
WS /ws/cluster-status | Real-time cluster metrics stream |
Cluster Health Example
Section titled “Cluster Health Example”curl http://gateway:8080/health{ "status": "healthy", "worker_count": 3, "gpu_count": 3, "models_loaded": 12, "configured_gpu_types": ["l4", "a100-80gb"], "live_gpu_types": ["l4"]}Metrics
Section titled “Metrics”Important gateway metrics include:
| Metric | Purpose |
|---|---|
sie_gateway_requests_total | HTTP requests by endpoint, status, and machine profile |
sie_gateway_request_latency_seconds | Gateway request latency |
sie_gateway_pending_demand | KEDA scale-from-zero trigger by machine profile and bundle |
sie_gateway_worker_queue_depth | Per-worker queue depth |
sie_gateway_config_epoch | Highest config epoch applied on this gateway |
sie_gateway_config_bootstrap_degraded | Whether bootstrap has been failing long enough to alert |
sie_gateway_config_deltas_total | NATS config-delta processing outcomes |
sie_gateway_nats_connected | Gateway NATS connection state |
What’s Next
Section titled “What’s Next”- Scale-from-Zero - the 202 flow and cold start handling
- Config API - runtime config writes and gateway readiness polling
- Kubernetes in GCP - full deployment with the gateway
- Monitoring - metrics and dashboards