Troubleshooting
Connection Issues
Section titled “Connection Issues”Connection refused / timeouts
Section titled “Connection refused / timeouts”Symptoms: ConnectionError, ECONNREFUSED, or request timeouts.
Causes and fixes:
- Server not running - Start with
docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-defaultorsie-server serve - Wrong port - Default is 8080. Check with
curl http://localhost:8080/healthz - Firewall/security group - Ensure port 8080 is open for your network
- Docker networking - Use
--network hostor ensure port mapping is correct (-p 8080:8080)
503 for unconfigured machine profile
Section titled “503 for unconfigured machine profile”Context: Kubernetes deployment with gateway.
Cause: The request pins a machine profile that is not configured in the cluster, or the queue/config path needed to route the request is unavailable. Normal scale-from-zero returns 202 Accepted, not 503.
Fix: Use a configured machine profile:
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \ -H "X-SIE-MACHINE-PROFILE: l4" \ -H "Content-Type: application/json" \ -d '{"items": [{"text": "Hello world"}]}'Or omit gpu and let the gateway resolve the model’s default route:
result = client.encode("BAAI/bge-m3", Item(text="hello"))const result = await client.encode("BAAI/bge-m3", { text: "hello" });See Scale-from-Zero for the full autoscaling flow.
202 responses that never resolve
Section titled “202 responses that never resolve”Context: Kubernetes with KEDA scale-to-zero.
Causes:
- Timeout too short - Cold starts take 5-7 minutes. Set
provision_timeout_s=420 - Spot GPUs unavailable - Try on-demand (
l4instead ofl4-spot) - KEDA not running - Check:
kubectl get pods -n keda - Prometheus unreachable - KEDA needs metrics:
kubectl get pods -n monitoring
# Recommended: use SDK with generous timeoutresult = client.encode( "BAAI/bge-m3", Item(text="hello"), gpu="l4", wait_for_capacity=True, provision_timeout_s=420,)// Recommended: use SDK with generous timeoutconst result = await client.encode( "BAAI/bge-m3", { text: "hello" }, { gpu: "l4", waitForCapacity: true, provisionTimeout: 420_000 },);Model Issues
Section titled “Model Issues”Model not found
Section titled “Model not found”Symptoms: 404 Not Found or “model not available” error.
Causes and fixes:
- Wrong model name - Use the SIE model ID (e.g.,
BAAI/bge-m3), not a custom alias. Check available models:curl http://localhost:8080/v1/models - Wrong bundle - Most models (including GLiNER and Florence-2) run on the
defaultbundle; large LLM embeddings require thesglangbundle. See Bundles - Model filter active - If
SIE_MODEL_FILTERis set, only listed models are available
Model loading is slow
Section titled “Model loading is slow”Context: First request to a model takes a long time.
Expected behavior: Models load on-demand. First request downloads weights (if not cached) and loads to GPU. Subsequent requests are fast.
| Scenario | Expected Time |
|---|---|
| Weights cached, loading to GPU | 10-30s (small model), 30-120s (large model) |
| Downloading from HuggingFace | 1-10 minutes depending on model size and network |
| Downloading from cluster cache (S3/GCS) | 30s-3 minutes |
Speed up loading:
- Mount a persistent HuggingFace cache:
-v ~/.cache/huggingface:/app/.cache/huggingface - Use cluster cache:
SIE_CLUSTER_CACHE=s3://bucket/weights - Pre-warm models by sending a dummy request at startup
GPU Issues
Section titled “GPU Issues”Docker GPU not detected
Section titled “Docker GPU not detected”Symptoms: Server falls back to CPU, or --gpus all fails.
Fixes:
- Install NVIDIA Container Toolkit:
# Ubuntu/Debiansudo apt-get install -y nvidia-container-toolkitsudo systemctl restart docker
- Verify GPU access:
docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
- Use the
--gpus allflag:docker run --gpus all -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cuda12-default
Out of memory (OOM)
Section titled “Out of memory (OOM)”Symptoms: CUDA out of memory, process killed, or pod evicted.
Causes and fixes:
- Model too large for GPU - Check model size vs GPU VRAM in Resources
- Too many models loaded - Lower
SIE_MEMORY_PRESSURE_THRESHOLD_PERCENT(default: 85) to trigger eviction earlier - Batch size too large - Reduce
SIE_MAX_BATCH_REQUESTS(default: 64) - Memory leak - Restart the server; report the issue if reproducible
Slow inference
Section titled “Slow inference”Possible causes:
- CPU fallback - Server may be running on CPU. Check with
sie-topor WebSocket status - Wrong attention backend - Flash Attention 2 is fastest on Ampere+ GPUs. Set
SIE_ATTENTION_BACKEND=flash_attention_2 - Small batches - Low concurrency means small batches. Increase
SIE_MAX_BATCH_WAIT_MSto wait longer for batch fill - Preprocessing bottleneck - For vision models, increase
SIE_IMAGE_WORKERS(default: 4)
LoRA Issues
Section titled “LoRA Issues”LoRA loading timeout
Section titled “LoRA loading timeout”Symptoms: Request hangs or times out when using a LoRA adapter.
Causes:
- LoRA too large - Large adapters take longer to download and load
- Incompatible base model - LoRA must match the base model architecture
- Cache full -
SIE_MAX_LORAS_PER_MODEL(default: 10) exceeded, triggering eviction + reload
LoRA adapter not found
Section titled “LoRA adapter not found”Fix: Ensure the LoRA ID is a valid HuggingFace repo:
result = client.encode( "BAAI/bge-m3", Item(text="hello"), options={"lora_id": "username/my-lora-adapter"})const result = await client.encode( "BAAI/bge-m3", { text: "hello" }, // `options` is supported at the wire level; cast until the TS type adds it. { options: { lora_id: "username/my-lora-adapter" } } as never,);Gated Model Access
Section titled “Gated Model Access””Access denied” or 401 for gated models
Section titled “”Access denied” or 401 for gated models”Cause: Some HuggingFace models require manual approval and a token.
Fixes:
- Accept the model’s license on HuggingFace (visit the model page)
- Set your HuggingFace token:
# Dockerdocker run --gpus all -p 8080:8080 \-e HF_TOKEN=hf_your_token_here \ghcr.io/superlinked/sie-server:latest-cuda12-default# Localexport HF_TOKEN=hf_your_token_heresie-server serve
- For Kubernetes, create a secret:
kubectl create secret generic hf-token \--from-literal=token=hf_your_token_here \-n sie
Kubernetes Issues
Section titled “Kubernetes Issues”Workers scale up then immediately down
Section titled “Workers scale up then immediately down”Cause: Requests stopped before the worker finished cold start. KEDA sees demand drop to 0.
Fix: Keep sending requests for the full cold start duration (5-7 minutes), or use the SDK with wait_for_capacity=True.
Different bundles not scaling
Section titled “Different bundles not scaling”Context: Default-bundle requests work fine, but requests routed to a different bundle worker (e.g. sglang) return only 202s.
Cause: Each bundle scales independently. A warm default worker does not make an sglang worker ready; each bundle’s worker pool cold-starts on its own.
Fix: Send the request with wait_for_capacity=True and provision_timeout_s=420. The target bundle’s worker pool will scale up independently.
Pods stuck in Pending
Section titled “Pods stuck in Pending”Causes:
- No GPU quota - Check:
kubectl describe pod <pod-name> -n sie - Node pool at max - Increase
maxReplicasin Helm values - Spot unavailable - Switch to on-demand instances
Getting Help
Section titled “Getting Help”If your issue isn’t covered here:
- Check server logs:
docker logs <container>orkubectl logs -n sie -l app.kubernetes.io/component=worker - Use
sie-topfor real-time monitoring - Open an issue on GitHub