Skip to content
Why did we open-source our inference engine? Read the post

Kubernetes in GCP

Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.

SIE runs as a gateway/config/worker architecture on Kubernetes:

GKE cluster architecture with Gateway, Config service, L4 and A100 worker pools, KEDA, and Prometheus

Components:

  • Gateway - Stateless Rust inference edge that routes requests to GPU-specific worker pools through NATS JetStream
  • Config service - Single-writer control plane for runtime model configuration
  • Worker Pools - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
  • KEDA - Scales worker pools from zero based on queue depth metrics
  • Prometheus - Provides metrics for autoscaling decisions

The gateway is a stateless Rust service that handles GPU-aware routing:

FeatureDescription
GPU RoutingRoutes requests to appropriate GPU pool via X-SIE-MACHINE-PROFILE header
Pool RoutingSupports tenant isolation via X-SIE-Pool header
Queue RoutingPublishes work to the selected pool’s NATS JetStream queue
Config ReadsMirrors model and bundle state from sie-config
202 ResponsesReturns Retry-After when GPU capacity is provisioning

The gateway runs as a Deployment with 2+ replicas for high availability.

gateway:
replicas: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"

Each GPU type runs as a separate StatefulSet with persistent storage for model caching.

PoolGPUVRAMUse Case
l4NVIDIA L424GBStandard inference, best price/performance
a100-40gbNVIDIA A10040GBLarge models, high throughput
a100-80gbNVIDIA A10080GBVery large models (7B+ parameters)

Worker configuration:

workers:
pools:
l4:
enabled: true
minReplicas: 0 # Scale to zero when idle
maxReplicas: 10
gpuType: l4
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
gpu:
count: 1
product: NVIDIA-L4
resources:
requests:
cpu: "4"
memory: "16Gi"

Workers use a 300Gi emptyDir volume for model cache. Models load on first request.


Specify the target GPU type using the X-SIE-MACHINE-PROFILE header or SDK parameter.

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-d '{"items": [{"text": "Hello world"}]}'
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.example.com")
# Route to L4 pool
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="l4"
)
# Route to A100 pool for large models
result = client.encode(
"intfloat/e5-mistral-7b-instruct",
Item(text="Hello world"),
gpu="a100-40gb"
)
GPU TypeHeader ValueMachine Type
NVIDIA L4l4g2-standard-8
NVIDIA A100 40GBa100-40gba2-highgpu-1g
NVIDIA A100 80GBa100-80gba2-ultragpu-1g

Resource pools provide tenant isolation by reserving dedicated workers.

Create a pool explicitly (created lazily on first request):

from sie_sdk import SIEClient
from sie_sdk.types import Item
# Client with dedicated pool (2 L4 workers reserved)
client = SIEClient("http://sie.example.com")
client.create_pool("tenant-abc", {"l4": 2})
# First request creates the pool, subsequent requests reuse it
result = client.encode(
"BAAI/bge-m3",
Item(text="Hello world"),
gpu="tenant-abc/l4" # pool_name/gpu_type
)
# Check pool status
info = client.get_pool("tenant-abc")
print(f"Pool {info['name']}: {info['status']['state']}")
# Explicit cleanup (optional - pools are GC'd after inactivity)
client.delete_pool("tenant-abc")

Use the X-SIE-Pool header:

curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "X-SIE-MACHINE-PROFILE: l4" \
-H "X-SIE-Pool: tenant-abc" \
-d '{"items": [{"text": "Hello world"}]}'

The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.


KEDA scales worker pools based on queue depth metrics from Prometheus.

When no workers are running and a request arrives:

  1. Gateway returns 202 Accepted with Retry-After: 120 header
  2. Gateway records pending demand metric
  3. KEDA detects queue depth > activation threshold
  4. GKE provisions GPU node (60-120 seconds)
  5. Worker pod starts and registers with the gateway
  6. Client retries and request succeeds
autoscaling:
enabled: true
prometheusAddress: http://prometheus-operated.monitoring.svc:9090
pollingInterval: 15 # Check metrics every 15s
cooldownPeriod: 900 # Wait 15 min before scaling to zero
scaleDownStabilization: 300 # 5 min stabilization window
queueDepthThreshold: 10 # Scale up at 10 pending requests/pod
queueDepthActivation: 2 # Activate from zero at 2 requests
fallbackReplicas: 2 # Fallback if Prometheus unavailable

When scaling from zero, expect these timelines:

PhaseDurationWhat Happens
Node provisioning2-5 minGKE finds a GPU node (spot may take longer)
Container startup20-40sPull image, start process
Model loading10-120sLoad weights to GPU (from cache or HuggingFace)

Total: 3-7 minutes from first request to first response. See Scale-from-Zero for the full flow and troubleshooting.

GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:

  • Consistent traffic: Lower cooldown (300s) for responsive scaling
  • Bursty traffic: Higher cooldown (900s) to avoid thrashing
  • Dev/test: Use spot instances for 60-70% cost savings

The examples/dev-l4-spot example in superlinked/terraform-google-sie provisions a complete GKE cluster with an L4 spot GPU pool via the published superlinked/sie/google Terraform registry module.

  1. GCP project with billing enabled.

  2. GPU quota for nvidia-l4 in your region:

    gcloud compute regions describe REGION \
    --format='table(quotas.filter(metric:NVIDIA))'

    The dev-l4-spot example uses spot, so look for PREEMPTIBLE_NVIDIA_L4_GPUS. Anything ≥ 4 covers the example’s max of 5 nodes × 1 GPU.

  3. Required APIs enabled:

    gcloud services enable \
    container.googleapis.com \
    compute.googleapis.com \
    artifactregistry.googleapis.com \
    iam.googleapis.com
  4. Authenticated:

    gcloud auth application-default login
git clone https://github.com/superlinked/terraform-google-sie.git
cd terraform-google-sie/examples/dev-l4-spot
# Set project ID
export TF_VAR_project_id="your-project-id"
# Initialize Terraform
terraform init
# Review changes
terraform plan
# Deploy cluster (15-20 minutes)
terraform apply
# Get credentials
$(terraform output -raw kubectl_command)
# Verify cluster
kubectl get nodes

Key configuration options for the superlinked/sie/google module:

VariableDefaultDescription
project_id(required)GCP project ID
regionus-central1GKE cluster region
cluster_namesie-devName of the GKE cluster
gpu_node_poolsL4 poolList of GPU node pool configurations
create_artifact_registrytrueProvision an Artifact Registry for custom images
deployer_service_account""Email of the SA running Terraform (optional, for CI/CD)
module "sie_gke" {
source = "superlinked/sie/google"
version = "0.3.4"
project_id = "my-project"
region = "us-central1"
cluster_name = "sie-prod"
gpu_node_pools = [
{
name = "l4-pool"
machine_type = "g2-standard-8"
gpu_type = "nvidia-l4"
gpu_count = 1
min_node_count = 1 # Keep 1 warm
max_node_count = 20
spot = false
},
{
name = "a100-pool"
machine_type = "a2-highgpu-1g"
gpu_type = "nvidia-tesla-a100"
gpu_count = 1
min_node_count = 0
max_node_count = 10
spot = true
}
]
}

Deploy SIE to an existing GKE cluster using Helm. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to install: false. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and the observability stack described elsewhere on this page, add the following to the install command:

--set keda.install=true \
--set autoscaling.enabled=true \
--set kube-prometheus-stack.install=true \
--set dcgm-exporter.install=true
  • GKE cluster with GPU node pools (the Terraform setup above creates this)
  • HF_TOKEN exported if you need gated models. Optional for the BAAI/bge-m3 smoke test; in that case, omit both --set hfToken.create=true and --set hfToken.value=... entirely (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

Extract the Workload Identity service-account email from the terraform output and wire it into the chart via --set. The example also enables the L4 worker pool explicitly — the chart’s worker pools default to enabled: false.

# The `workload_identity_annotation` output is the full `key=email` pair;
# strip the prefix to get just the SA email for the --set value.
WI_SA=$(terraform output -raw workload_identity_annotation | cut -d= -f2)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie --create-namespace \
--set "serviceAccount.annotations.iam\.gke\.io/gcp-service-account=$WI_SA" \
--set workers.pools.l4.enabled=true \
--set workers.pools.l4.minReplicas=1 \
--set hfToken.create=true \
--set hfToken.value="$HF_TOKEN"
# Wait for rollout
kubectl -n sie get pods -w

minReplicas: 1 keeps one L4 worker always running, which is the simplest path to a working smoke test without KEDA installed. For true scale-from-zero, additionally pass --set keda.install=true --set autoscaling.enabled=true and set minReplicas: 0.

# custom-values.yaml
gateway:
replicas: 3
workers:
common:
bundle: default
cacheVolumeSize: 100Gi
clusterCache:
enabled: true
url: gs://my-bucket/models
pools:
l4:
enabled: true
minReplicas: 1
maxReplicas: 20
autoscaling:
enabled: true
cooldownPeriod: 300
ingress:
enabled: true
host: sie.example.com
tls:
enabled: true
secretName: sie-tls
auth:
enabled: true
oauth2Proxy:
oidcIssuerUrl: https://auth.example.com/realms/sie
serviceMonitor:
enabled: true
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie
# Check pods
kubectl get pods -n sie
# Check gateway logs
kubectl logs -n sie -l app.kubernetes.io/component=gateway
# Port-forward the gateway and run a smoke test
kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &
# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.

helm uninstall sie -n sie
terraform destroy
  • Ingress controller: use ingress-nginx for public or private access.
  • Public vs private: set ingress-nginx service annotations for internal LBs on GKE.
  • Auth options:
    • OIDC (oauth2-proxy) with external IdP or Dex.
    • Static token (gateway-level) for OSS/self-hosted without IdP.
    • No auth + private ingress (internal LB).
# Static token mode for self-hosted clusters
kubectl create secret generic sie-auth-tokens -n sie \
--from-literal=SIE_AUTH_TOKEN="key1,key2,key3"
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie \
--set gateway.auth.mode=static \
--set gateway.auth.tokenSecretName=sie-auth-tokens

Debug-only access via port-forward is still possible, but production paths should use ingress.


Contact us

Tell us about your use case and we'll get back to you shortly.