Skip to content
Why did we open-source our inference engine? Read the post

Upgrade Runbook

Procedure for upgrading an SIE cluster to a new release version. Covers Helm-managed deployments on GKE and EKS.

Components upgraded:

  • Gateway (Deployment) - stateless inference edge, fast restart
  • Config service (Deployment) - single-replica config control plane
  • Worker pools (StatefulSets) - GPU pods, model cache in emptyDir

Version management: SIE uses release-please for unified versioning. A single version (e.g., 0.1.6) is applied to the Helm chart (Chart.yaml appVersion), Python packages, the Rust gateway crate, and TypeScript packages. The CHANGELOG.md at the repo root documents all changes per release.


Complete all items before starting the upgrade.

Read CHANGELOG.md for the target version. Pay attention to:

  • Breaking changes in the gateway, config API, or server API
  • Helm values changes (new required values, renamed keys, removed options)
  • Model config changes (new or removed models, adapter changes)
# View changelog for the target version
git log v<CURRENT>..v<TARGET> --oneline
# Note current Helm release version
helm list -n sie
# Note current chart values (save for rollback reference)
helm get values sie -n sie -o yaml > /tmp/sie-values-backup.yaml
# Back up pool state (ConfigMaps + Leases in the sie namespace)
kubectl get configmap,lease -n sie -o yaml > /tmp/sie-pool-state-backup.yaml
# Record current image tags
kubectl get deployment -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.template.spec.containers[0].image}{"\n"}{end}'
kubectl get statefulset -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.template.spec.containers[0].image}{"\n"}{end}'
# Record Helm revision number
helm history sie -n sie --max 5
# All gateway pods should be Running and Ready
kubectl get pods -n sie -l app.kubernetes.io/component=gateway
# The config service should be Running and Ready
kubectl get pods -n sie -l app.kubernetes.io/component=config
# All worker pods should be Running and Ready (if not scaled to zero)
kubectl get pods -n sie -l app.kubernetes.io/component=worker
# Gateway readiness (returns {"status": "ready"})
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/readyz
# Gateway detailed health (returns worker count, GPU count, loaded models)
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/health
# Config service health
kubectl exec -n sie deploy/sie-sie-cluster-config -- wget -qO- http://localhost:8080/healthz
# KEDA ScaledObjects should not be in Fallback mode
kubectl get scaledobject -n sie
kubectl describe scaledobject -n sie | grep -A2 "Type.*Fallback"
# Check for recent errors in gateway and config logs
kubectl logs -n sie -l app.kubernetes.io/component=gateway --tail=50 | grep -i error
kubectl logs -n sie -l app.kubernetes.io/component=config --tail=50 | grep -i error
# Check for recent errors in worker logs
kubectl logs -n sie -l app.kubernetes.io/component=worker --tail=50 | grep -i error
# Prometheus is serving queries
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=up' 2>/dev/null | head -c 200
# Grafana is accessible
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
# Open http://localhost:3000 and verify SIE dashboards show data

If running during active traffic, consider:

# Pause KEDA autoscaling to prevent scale events during upgrade.
# Each ScaledObject targets a specific StatefulSet, so freeze each one
# at its own replica count (pools may differ).
for so in $(kubectl get scaledobject -n sie -o jsonpath='{.items[*].metadata.name}'); do
# Read the actual scale target from the ScaledObject spec
sts=$(kubectl get scaledobject "$so" -n sie -o jsonpath='{.spec.scaleTargetRef.name}')
replicas=$(kubectl get statefulset "$sts" -n sie -o jsonpath='{.spec.replicas}' 2>/dev/null)
if [ -n "$replicas" ]; then
kubectl annotate scaledobject "$so" -n sie \
autoscaling.keda.sh/paused-replicas="$replicas" --overwrite
fi
done

For clusters using custom image registries (not the default ghcr.io/superlinked), push the new images first:

# Build and push new images (adjust registry as needed)
REGISTRY="your-registry.example.com"
TAG="0.1.7" # Target version
# Build and push config + server bundle images in one bake invocation.
# Add --bake-bundles or --bake-platform if you need non-default coverage.
mise run docker -- \
--bake \
--bake-include-gateway \
--bake-tag "$TAG" \
--registry "$REGISTRY/" \
--push
# Dry-run first to preview changes
helm diff upgrade sie deploy/helm/sie-cluster/ \
-n sie \
-f /tmp/sie-values-backup.yaml \
--set workers.common.image.tag="<TARGET_VERSION>" \
--set gateway.image.tag="<TARGET_VERSION>" \
--set config.image.tag="<TARGET_VERSION>"
# Apply the upgrade (--wait blocks until pods are ready; --timeout guards against hangs)
helm upgrade sie deploy/helm/sie-cluster/ \
-n sie \
-f /tmp/sie-values-backup.yaml \
--set workers.common.image.tag="<TARGET_VERSION>" \
--set gateway.image.tag="<TARGET_VERSION>" \
--set config.image.tag="<TARGET_VERSION>" \
--wait --timeout 10m
# Dry-run
helm diff upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
-n sie \
--version <TARGET_CHART_VERSION> \
-f /tmp/sie-values-backup.yaml
# Apply
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
-n sie \
--version <TARGET_CHART_VERSION> \
-f /tmp/sie-values-backup.yaml \
--wait --timeout 10m
# Update image tag in Terraform variables
# Edit your .tfvars or set TF_VAR:
export TF_VAR_sie_image_tag="<TARGET_VERSION>"
cd deploy/terraform/gcp/examples/<your-env>
terraform plan # Review changes
terraform apply # Apply

2.3 Expected Behavior During Rolling Update

Section titled “2.3 Expected Behavior During Rolling Update”

Gateway (Deployment):

  • Kubernetes rolls out new gateway pods one at a time (default RollingUpdate strategy).
  • Startup probe gates the other probes: GET /healthz, periodSeconds: 5, failureThreshold: 12 (up to 60 s for boot).
  • Once startup passes, liveness polls GET /healthz every 10 s and readiness polls GET /readyz every 5 s. /readyz returns 200 even with zero connected workers — the gateway accepts traffic and emits 202 for cold-start cases.
  • The gateway is stateless on the request path; new pods come up in seconds.
  • Brief 503s are possible during the switchover window if all old pods are terminated before new ones pass readiness.

Config service (Deployment):

  • sie-config is intentionally single-replica because it owns serialized config writes and epoch bumps.
  • If the config-store PVC is enabled, the chart uses a Recreate strategy to avoid ReadWriteOnce mount conflicts.
  • The gateway keeps serving from its in-memory registry during a short config-service restart, but config writes and bootstrap/drift recovery depend on sie-config.

Workers (StatefulSets):

  • The default RollingUpdate strategy updates pods one at a time in reverse ordinal order. (podManagementPolicy: Parallel only affects pod ordering during scaling, not rolling updates.)
  • Worker terminationGracePeriodSeconds: 65.
  • preStop hook: sleep 10 - gives the K8s endpoints controller 10 seconds to remove the pod from the service before SIGTERM.
  • On SIGTERM, the server enters graceful shutdown: rejects new requests with 503 (with Retry-After: 5 header), drains in-flight requests (25-second timeout), then exits.
  • Readiness probe stops passing (/readyz returns 503) once shutdown begins, so the gateway stops treating the pod as available.
  • The gateway detects worker disconnection via WebSocket and removes it from the routing table.
  • New worker pods must download model weights if the emptyDir cache is empty (cache does not persist across pod restarts). Cold model loading can take 10-120 seconds depending on model size and cache state.
  • PodDisruptionBudget: maxUnavailable: 1 per worker pool - protects against external disruptions (e.g., kubectl drain, node autoscaler) but is not enforced by the StatefulSet controller during rolling updates.

Client Impact:

  • SDK clients with automatic retry handle 503s transparently.
  • Requests in flight during graceful shutdown complete normally (up to 25-second drain timeout).
  • If all workers in a pool are restarting simultaneously, the gateway returns 202 Accepted (provisioning), and the SDK retries with backoff.
# Watch gateway and config rollouts
kubectl rollout status deployment/sie-sie-cluster-gateway -n sie --timeout=120s
kubectl rollout status deployment/sie-sie-cluster-config -n sie --timeout=120s
# Watch worker rollouts (one per pool)
kubectl get statefulsets -n sie -w
# Watch all pods
kubectl get pods -n sie -w
# Check KEDA ScaledObjects are still healthy (not Fallback)
kubectl get scaledobject -n sie -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,MIN:.spec.minReplicaCount,MAX:.spec.maxReplicaCount,REPLICAS:.status.currentReplicas
# Watch gateway logs for errors during transition
kubectl logs -n sie -l app.kubernetes.io/component=gateway -f --tail=20

# All pods Running and Ready
kubectl get pods -n sie
# Expected: gateway/config pods 1/1 Ready, all active worker pods 1/1 Ready
# Verify new image tags are deployed
kubectl get pods -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'
# Readiness check
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/readyz
# Expected: {"status": "ready"}
# Detailed health (worker count, models, GPU types)
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/health
# Expected: "status": "healthy", worker_count > 0 (if pools not scaled to zero)
# Config service is healthy
kubectl exec -n sie deploy/sie-sie-cluster-config -- wget -qO- http://localhost:8080/healthz
# Model catalog is available
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/v1/models | head -c 500
# Port-forward to gateway
kubectl port-forward -n sie svc/sie-sie-cluster-gateway 8080:8080 &
# Test encode request (requires a running worker with GPU)
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'upgrade verification test'})
print(f'Dense embedding dim: {len(result[\"dense\"])}')
print('SUCCESS: Encode request returned 200')
"
# Or with curl (JSON fallback):
curl -s -X POST http://localhost:8080/v1/encode/BAAI%2Fbge-m3 \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "upgrade verification test"}]}' | python3 -m json.tool | head -5
# Unpause KEDA if paused in step 1.5
kubectl annotate scaledobject -n sie --all autoscaling.keda.sh/paused-replicas- --overwrite
# Verify ScaledObjects are Ready (not Fallback)
kubectl get scaledobject -n sie
kubectl describe scaledobject -n sie | grep -A3 "Conditions:"
# Expected: Ready=True, Active depends on load, Fallback=False
# Verify Prometheus is scraping the new pods
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=sie_requests_total' 2>/dev/null | python3 -m json.tool | head -20
# Verify gateway metrics
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=sie_gateway_requests_total' 2>/dev/null | python3 -m json.tool | head -20
# Check Grafana dashboards show data for new pods
# Port-forward: kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Navigate to SIE > Cluster Overview dashboard
# Check Helm release version
helm list -n sie
# Expected: Chart version and App version match target
# Check the server version header on a response
curl -s -I http://localhost:8080/healthz | grep -i x-sie
# Expected: X-SIE-Server-Version: <TARGET_VERSION>

# List Helm release history
helm history sie -n sie --max 10
# Note the REVISION number of the last known-good release
# Rollback to previous revision
helm rollback sie <REVISION> -n sie
# Or rollback to immediately previous version
helm rollback sie -n sie

For Terraform-managed clusters:

# Revert image tag to previous version
export TF_VAR_sie_image_tag="<PREVIOUS_VERSION>"
cd deploy/terraform/gcp/examples/<your-env>
terraform apply
# Watch the rollback proceed
kubectl rollout status deployment/sie-sie-cluster-gateway -n sie --timeout=120s
kubectl get pods -n sie -w
# Verify old image is restored
kubectl get pods -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'

Run the same post-upgrade verification steps:

# Gateway health
kubectl exec -n sie deploy/sie-sie-cluster-gateway -- wget -qO- http://localhost:8080/readyz
# Encode smoke test
kubectl port-forward -n sie svc/sie-sie-cluster-gateway 8080:8080 &
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'rollback verification'})
print(f'Dense dim: {len(result[\"dense\"])} - SUCCESS')
"
# KEDA health
kubectl get scaledobject -n sie
  • No database migrations: Workers use emptyDir for model cache, and the gateway stores pool state in ConfigMaps with Leases for TTL. sie-config persists model config YAML and an epoch counter only when its config store is enabled.
  • Model cache invalidation: Worker pods use emptyDir volumes for the HuggingFace model cache. Rolling back means new pods start with an empty cache and must re-download model weights on first request. If cluster cache (S3/GCS) is configured, downloads come from there instead of HuggingFace Hub.
  • Pool state: Resource pools are stored as ConfigMaps in the sie namespace. Pool leases survive upgrades and rollbacks. Active pools will continue to work, but if the pool API changed between versions, clients may need to recreate pools.
  • KEDA ScaledObjects: Helm rollback re-applies the previous ScaledObject definitions. If KEDA version requirements changed between SIE versions, verify ScaledObjects are not in Fallback mode after rollback.
  • Config drift: If the upgrade included changes to embedded model or bundle configs (baked into the Helm chart files/ directory), rollback restores the previous configs. Ensure the previous configs are compatible with the previous server version.
  • SDK version compatibility: The gateway returns X-SIE-Server-Version headers. If clients upgraded their SDK alongside the server, a server rollback may trigger version mismatch warnings in the SDK logs. The SDK remains functional but logs warnings for major.minor mismatches.

ResourceNamespaceTypePurpose
sie-sie-cluster-gatewaysieDeploymentStateless request gateway (2+ replicas)
sie-sie-cluster-configsieDeploymentSingle-writer config control plane
sie-sie-cluster-worker-<pool>sieStatefulSetGPU worker pool (one per pool)
sie-sie-cluster-workersieService (headless)Worker DNS discovery
sie-sie-cluster-gatewaysieService (ClusterIP)Gateway endpoint
sie-sie-cluster-configsieService (ClusterIP)Internal config API
sie-sie-cluster-worker-<pool>-scalersieScaledObjectKEDA autoscaler per pool
sie-sie-cluster-worker-<pool>siePodDisruptionBudgetmaxUnavailable: 1 per pool
sie-sie-cluster-gpu-configsieConfigMapAvailable GPU types / machine profiles
sie-sie-cluster-configsieConfigMapShared cluster configuration
EndpointComponentReturns
GET /healthzGateway{"status": "ok"} - liveness probe
GET /readyzGateway{"status": "ready"} - readiness probe
GET /healthGatewayDetailed cluster status (worker count, GPUs, models)
GET /healthzConfig service{"status": "ok"} - liveness probe
GET /healthzWorker"ok" - liveness probe
GET /readyzWorker"ok" or 503 - readiness probe
GET /metricsBothPrometheus metrics
DashboardPurpose
Cluster OverviewQPS, latency (p50/p95/p99), GPU utilization
Model PerformancePer-model latency, throughput, batch sizes
Worker HealthPer-worker CPU/memory, GPU temp, queue depth

Contact us

Tell us about your use case and we'll get back to you shortly.