Troubleshooting

Connection Issues

Connection refused / timeouts

Symptoms: ConnectionError, ECONNREFUSED, or request timeouts.

Causes and fixes:

Server not running - Start with docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default or sie-server serve
Wrong port - Default is 8080. Check with curl http://localhost:8080/healthz
Firewall/security group - Ensure port 8080 is open for your network
Docker networking - Use --network host or ensure port mapping is correct (-p 8080:8080)

503 for unconfigured machine profile

Context: Kubernetes deployment with gateway.

Cause: The request pins a machine profile that is not configured in the cluster, or the queue/config path needed to route the request is unavailable. Normal scale-from-zero returns 202 Accepted, not 503.

Fix: Use a configured machine profile:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "Content-Type: application/json" \
  -d '{"items": [{"text": "Hello world"}]}'

Or omit gpu and let the gateway resolve the model’s default route:

Python
TypeScript

result = client.encode("BAAI/bge-m3", Item(text="hello"))

const result = await client.encode("BAAI/bge-m3", { text: "hello" });

See Scale-from-Zero for the full autoscaling flow.

202 responses that never resolve

Context: Kubernetes with KEDA scale-to-zero.

Causes:

Timeout too short - Cold starts take 5-7 minutes. Set provision_timeout_s=420
Spot GPUs unavailable - Try on-demand (l4 instead of l4-spot)
KEDA not running - Check: kubectl get pods -n keda
Prometheus unreachable - KEDA needs metrics: kubectl get pods -n monitoring

Python
TypeScript

# Recommended: use SDK with generous timeout
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    gpu="l4",
    wait_for_capacity=True,
    provision_timeout_s=420,
)

// Recommended: use SDK with generous timeout
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  { gpu: "l4", waitForCapacity: true, provisionTimeout: 420_000 },
);

Model Issues

Model not found

Symptoms: 404 Not Found or “model not available” error.

Causes and fixes:

Wrong model name - Use the SIE model ID (e.g., BAAI/bge-m3), not a custom alias. Check available models: curl http://localhost:8080/v1/models
Wrong bundle - Most models (including GLiNER and Florence-2) run on the default bundle; large LLM embeddings require the sglang bundle. See Bundles
Model filter active - If SIE_MODEL_FILTER is set, only listed models are available

Model loading is slow

Context: First request to a model takes a long time.

Expected behavior: Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.

Scenario	Expected Time
Weights cached, loading to GPU	10-30s (small model), 30-120s (large model)
Downloading from HuggingFace	1-10 minutes depending on model size and network
Downloading from cluster cache (S3/GCS)	30s-3 minutes

Speed up loading:

Mount a persistent HuggingFace cache: -v ~/.cache/huggingface:/app/.cache/huggingface
Use cluster cache: SIE_CLUSTER_CACHE=s3://bucket/weights
Pre-warm models by sending a dummy request at startup

GPU Issues

Docker GPU not detected

Symptoms: Server falls back to CPU, or --gpus all fails.

Fixes:

Install NVIDIA Container Toolkit:

# Ubuntu/Debian
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU access:

docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Use the --gpus all flag:

docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default

Out of memory (OOM)

Symptoms: CUDA out of memory, process killed, or pod evicted.

Causes and fixes:

Model too large for GPU - Check model size vs GPU VRAM in Resources
Too many models loaded - Lower SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT (default: 85) to trigger eviction earlier
Batch size too large - Reduce SIE_MAX_BATCH_REQUESTS (default: 64)
Memory leak - Restart the server; report the issue if reproducible

Slow inference

Possible causes:

CPU fallback - Server may be running on CPU. Check with sie-top or WebSocket status
Wrong attention backend - Flash Attention 2 is fastest on Ampere+ GPUs. Set SIE_ATTENTION_BACKEND=flash_attention_2
Small batches - Low concurrency means small batches. Increase SIE_MAX_BATCH_WAIT_MS to wait longer for batch fill
Preprocessing bottleneck - For vision models, increase SIE_IMAGE_WORKERS (default: 4)

LoRA Issues

LoRA loading timeout

Symptoms: Request hangs or times out when using a LoRA adapter.

Causes:

LoRA too large - Large adapters take longer to download and load
Incompatible base model - LoRA must match the base model architecture
Cache full - SIE_MAX_LORAS_PER_MODEL (default: 10) exceeded, triggering eviction + reload

LoRA adapter not found

Fix: Ensure the LoRA ID is a valid HuggingFace repo:

Python
TypeScript

result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    options={"lora_id": "username/my-lora-adapter"}
)

const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  // `options` is supported at the wire level; cast until the TS type adds it.
  { options: { lora_id: "username/my-lora-adapter" } } as never,
);

Gated Model Access

”Access denied” or 401 for gated models

Cause: Some HuggingFace models require manual approval and a token.

Fixes:

Accept the model’s license on HuggingFace (visit the model page)

Set your HuggingFace token:

# Docker
docker run --gpus all -p 8080:8080 \
  -e HF_TOKEN=hf_your_token_here \
  ghcr.io/superlinked/sie-server:latest-cuda12-default

# Local
export HF_TOKEN=hf_your_token_here
sie-server serve

For Kubernetes, create a secret:

kubectl create secret generic hf-token \
  --from-literal=token=hf_your_token_here \
  -n sie

Kubernetes Issues

Workers scale up then immediately down

Cause: Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.

Fix: Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with wait_for_capacity=True.

Different bundles not scaling

Context: Default-bundle requests work fine, but requests routed to a different bundle worker (e.g. sglang) return only 202s.

Cause: Each bundle scales independently. A warm default worker does not make an sglang worker ready; each bundle’s worker pool cold-starts on its own.

Fix: Send the request with wait_for_capacity=True and provision_timeout_s=420. The target bundle’s worker pool will scale up independently.

Pods stuck in Pending

Causes:

No GPU quota - Check: kubectl describe pod <pod-name> -n sie
Node pool at max - Increase maxReplicas in Helm values
Spot unavailable - Switch to on-demand instances

Getting Help

If your issue isn’t covered here:

Check server logs: docker logs <container> or kubectl logs -n sie -l app.kubernetes.io/component=worker
Use sie-top for real-time monitoring
Open an issue on GitHub