---
title: Troubleshooting
description: Common issues and solutions for SIE server, SDK, and deployments.
canonical_url: https://superlinked.com/docs/reference/troubleshooting
last_updated: 2026-05-19
---

## Connection Issues

### Connection refused / timeouts

**Symptoms:** `ConnectionError`, `ECONNREFUSED`, or request timeouts.

**Causes and fixes:**
- **Server not running** - Start with `docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default` or `sie-server serve`
- **Wrong port** - Default is 8080. Check with `curl http://localhost:8080/healthz`
- **Firewall/security group** - Ensure port 8080 is open for your network
- **Docker networking** - Use `--network host` or ensure port mapping is correct (`-p 8080:8080`)

### 503 for unconfigured machine profile

**Context:** Kubernetes deployment with gateway.

**Cause:** The request pins a machine profile that is not configured in the cluster, or the queue/config path needed to route the request is unavailable. Normal scale-from-zero returns `202 Accepted`, not `503`.

**Fix:** Use a configured machine profile:

```bash
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'
```

Or omit `gpu` and let the gateway resolve the model's default route:

#### Python

```python
result = client.encode("BAAI/bge-m3", Item(text="hello"))
```

#### TypeScript

```typescript
const result = await client.encode("BAAI/bge-m3", { text: "hello" });
```

See [Scale-from-Zero](/docs/deployment/autoscaling/) for the full autoscaling flow.

### 202 responses that never resolve

**Context:** Kubernetes with KEDA scale-to-zero.

**Causes:**
- **Timeout too short** - Cold starts take 5-7 minutes. Set `provision_timeout_s=420`
- **Spot GPUs unavailable** - Try on-demand (`l4` instead of `l4-spot`)
- **KEDA not running** - Check: `kubectl get pods -n keda`
- **Prometheus unreachable** - KEDA needs metrics: `kubectl get pods -n monitoring`

#### Python

```python
# Recommended: use SDK with generous timeout
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    gpu="l4",
    wait_for_capacity=True,
    provision_timeout_s=420,
)
```

#### TypeScript

```typescript
// Recommended: use SDK with generous timeout
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  { gpu: "l4", waitForCapacity: true, provisionTimeout: 420_000 },
);
```

---

## Model Issues

### Model not found

**Symptoms:** `404 Not Found` or "model not available" error.

**Causes and fixes:**
- **Wrong model name** - Use the SIE model ID (e.g., `BAAI/bge-m3`), not a custom alias. Check available models: `curl http://localhost:8080/v1/models`
- **Wrong bundle** - Most models (including GLiNER and Florence-2) run on the `default` bundle; large LLM embeddings require the `sglang` bundle. See [Bundles](/docs/engine/bundles/)
- **Model filter active** - If `SIE_MODEL_FILTER` is set, only listed models are available

### Model loading is slow

**Context:** First request to a model takes a long time.

**Expected behavior:** Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.

| Scenario | Expected Time |
|----------|--------------|
| Weights cached, loading to GPU | 10-30s (small model), 30-120s (large model) |
| Downloading from HuggingFace | 1-10 minutes depending on model size and network |
| Downloading from cluster cache (S3/GCS) | 30s-3 minutes |

**Speed up loading:**
- Mount a persistent HuggingFace cache: `-v ~/.cache/huggingface:/app/.cache/huggingface`
- Use cluster cache: `SIE_CLUSTER_CACHE=s3://bucket/weights`
- Pre-warm models by sending a dummy request at startup

---

## GPU Issues

### Docker GPU not detected

**Symptoms:** Server falls back to CPU, or `--gpus all` fails.

**Fixes:**
1. Install NVIDIA Container Toolkit:
   ```bash
   # Ubuntu/Debian
   sudo apt-get install -y nvidia-container-toolkit
   sudo systemctl restart docker
   ```
2. Verify GPU access:
   ```bash
   docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
   ```
3. Use the `--gpus all` flag:
   ```bash
   docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
   ```

### Out of memory (OOM)

**Symptoms:** `CUDA out of memory`, process killed, or pod evicted.

**Causes and fixes:**
- **Model too large for GPU** - Check model size vs GPU VRAM in [Resources](/docs/deployment/resources/)
- **Too many models loaded** - Lower `SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT` (default: 85) to trigger eviction earlier
- **Batch size too large** - Reduce `SIE_MAX_BATCH_REQUESTS` (default: 64)
- **Memory leak** - Restart the server; report the issue if reproducible

### Slow inference

**Possible causes:**
- **CPU fallback** - Server may be running on CPU. Check with `sie-top` or WebSocket status
- **Wrong attention backend** - Flash Attention 2 is fastest on Ampere+ GPUs. Set `SIE_ATTENTION_BACKEND=flash_attention_2`
- **Small batches** - Low concurrency means small batches. Increase `SIE_MAX_BATCH_WAIT_MS` to wait longer for batch fill
- **Preprocessing bottleneck** - For vision models, increase `SIE_IMAGE_WORKERS` (default: 4)

---

## LoRA Issues

### LoRA loading timeout

**Symptoms:** Request hangs or times out when using a LoRA adapter.

**Causes:**
- **LoRA too large** - Large adapters take longer to download and load
- **Incompatible base model** - LoRA must match the base model architecture
- **Cache full** - `SIE_MAX_LORAS_PER_MODEL` (default: 10) exceeded, triggering eviction + reload

### LoRA adapter not found

**Fix:** Ensure the LoRA ID is a valid HuggingFace repo:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    options={"lora_id": "username/my-lora-adapter"}
)
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  // `options` is supported at the wire level; cast until the TS type adds it.
  { options: { lora_id: "username/my-lora-adapter" } } as never,
);
```

---

## Gated Model Access

### "Access denied" or 401 for gated models

**Cause:** Some HuggingFace models require manual approval and a token.

**Fixes:**
1. Accept the model's license on HuggingFace (visit the model page)
2. Set your HuggingFace token:
   ```bash
   # Docker
   docker run --gpus all -p 8080:8080 \
     -e HF_TOKEN=hf_your_token_here \
     ghcr.io/superlinked/sie-server:latest-cuda12-default

   # Local
   export HF_TOKEN=hf_your_token_here
   sie-server serve
   ```
3. For Kubernetes, create a secret:
   ```bash
   kubectl create secret generic hf-token \
     --from-literal=token=hf_your_token_here \
     -n sie
   ```

---

## Kubernetes Issues

### Workers scale up then immediately down

**Cause:** Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.

**Fix:** Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with `wait_for_capacity=True`.

### Different bundles not scaling

**Context:** Default-bundle requests work fine, but requests routed to a different bundle worker (e.g. `sglang`) return only 202s.

**Cause:** Each bundle scales independently. A warm `default` worker does not make an `sglang` worker ready; each bundle's worker pool cold-starts on its own.

**Fix:** Send the request with `wait_for_capacity=True` and `provision_timeout_s=420`. The target bundle's worker pool will scale up independently.

### Pods stuck in Pending

**Causes:**
- **No GPU quota** - Check: `kubectl describe pod <pod-name> -n sie`
- **Node pool at max** - Increase `maxReplicas` in Helm values
- **Spot unavailable** - Switch to on-demand instances

---

## Getting Help

If your issue isn't covered here:

1. Check server logs: `docker logs <container>` or `kubectl logs -n sie -l app.kubernetes.io/component=worker`
2. Use `sie-top` for real-time monitoring
3. Open an issue on [GitHub](https://github.com/superlinked/sie/issues)