---
title: Kubernetes in GCP
description: Deploy SIE on Google Kubernetes Engine with GPU autoscaling.
canonical_url: https://superlinked.com/docs/deployment/cloud-gcp
last_updated: 2026-05-20
---

Deploy SIE to GKE with GPU node pools, KEDA autoscaling, and Terraform automation.

## Architecture

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

SIE runs as a gateway/config/worker architecture on Kubernetes:

![GKE cluster architecture with Gateway, Config service, L4 and A100 worker pools, KEDA, and Prometheus](/diagrams/gke-arch.svg)

**Components:**
- **Gateway** - Stateless Rust inference edge that routes requests to GPU-specific worker pools through NATS JetStream
- **Config service** - Single-writer control plane for runtime model configuration
- **Worker Pools** - StatefulSets grouped by GPU type (L4, A100-40GB, A100-80GB)
- **KEDA** - Scales worker pools from zero based on queue depth metrics
- **Prometheus** - Provides metrics for autoscaling decisions

---

## Gateway

Source: [packages/sie_gateway/src/handlers/proxy.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/proxy.rs)

The gateway is a stateless Rust service that handles GPU-aware routing:

| Feature | Description |
|---------|-------------|
| GPU Routing | Routes requests to appropriate GPU pool via `X-SIE-MACHINE-PROFILE` header |
| Pool Routing | Supports tenant isolation via `X-SIE-Pool` header |
| Queue Routing | Publishes work to the selected pool's NATS JetStream queue |
| Config Reads | Mirrors model and bundle state from `sie-config` |
| 202 Responses | Returns `Retry-After` when GPU capacity is provisioning |

The gateway runs as a Deployment with 2+ replicas for high availability.

```yaml
gateway:
  replicas: 2
  resources:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "2Gi"
```

---

## Worker Pools

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

Each GPU type runs as a separate StatefulSet with persistent storage for model caching.

| Pool | GPU | VRAM | Use Case |
|------|-----|------|----------|
| `l4` | NVIDIA L4 | 24GB | Standard inference, best price/performance |
| `a100-40gb` | NVIDIA A100 | 40GB | Large models, high throughput |
| `a100-80gb` | NVIDIA A100 | 80GB | Very large models (7B+ parameters) |

Worker configuration:

```yaml
workers:
  pools:
    l4:
      enabled: true
      minReplicas: 0        # Scale to zero when idle
      maxReplicas: 10
      gpuType: l4
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      gpu:
        count: 1
        product: NVIDIA-L4
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
```

Workers use a 300Gi emptyDir volume for model cache. Models load on first request.

---

## GPU Selection

Source: [packages/sie_gateway/src/handlers/proxy.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/proxy.rs)

Specify the target GPU type using the `X-SIE-MACHINE-PROFILE` header or SDK parameter.

### HTTP Header

```bash
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -d '{"items": [{"text": "Hello world"}]}'
```

### SDK Parameter

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://sie.example.com")

# Route to L4 pool
result = client.encode(
    "BAAI/bge-m3",
    Item(text="Hello world"),
    gpu="l4"
)

# Route to A100 pool for large models
result = client.encode(
    "intfloat/e5-mistral-7b-instruct",
    Item(text="Hello world"),
    gpu="a100-40gb"
)
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://sie.example.com");

// Route to L4 pool
let result = await client.encode(
  "BAAI/bge-m3",
  { text: "Hello world" },
  { gpu: "l4" },
);

// Route to A100 pool for large models
result = await client.encode(
  "intfloat/e5-mistral-7b-instruct",
  { text: "Hello world" },
  { gpu: "a100-40gb" },
);
```

### Available GPU Types

| GPU Type | Header Value | Machine Type |
|----------|--------------|--------------|
| NVIDIA L4 | `l4` | g2-standard-8 |
| NVIDIA A100 40GB | `a100-40gb` | a2-highgpu-1g |
| NVIDIA A100 80GB | `a100-80gb` | a2-ultragpu-1g |

---

## Resource Pools

Source: [packages/sie_gateway/src/handlers/pools.rs](https://github.com/superlinked/sie/blob/main/packages/sie_gateway/src/handlers/pools.rs)

Resource pools provide tenant isolation by reserving dedicated workers.

### Create a Pool via SDK

Create a pool explicitly (created lazily on first request):

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

# Client with dedicated pool (2 L4 workers reserved)
client = SIEClient("http://sie.example.com")
client.create_pool("tenant-abc", {"l4": 2})

# First request creates the pool, subsequent requests reuse it
result = client.encode(
    "BAAI/bge-m3",
    Item(text="Hello world"),
    gpu="tenant-abc/l4"  # pool_name/gpu_type
)

# Check pool status
info = client.get_pool("tenant-abc")
print(f"Pool {info['name']}: {info['status']['state']}")

# Explicit cleanup (optional - pools are GC'd after inactivity)
client.delete_pool("tenant-abc")
```

### Route to Pool via HTTP

Use the `X-SIE-Pool` header:

```bash
curl -X POST http://sie.example.com/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "X-SIE-MACHINE-PROFILE: l4" \
  -H "X-SIE-Pool: tenant-abc" \
  -d '{"items": [{"text": "Hello world"}]}'
```

The SDK handles lease renewal automatically. Pools are garbage collected after inactivity.

---

## KEDA Autoscaling

Source: [deploy/helm/sie-cluster/values.yaml](https://github.com/superlinked/sie/blob/main/deploy/helm/sie-cluster/values.yaml)

KEDA scales worker pools based on queue depth metrics from Prometheus.

### Scale-from-Zero

When no workers are running and a request arrives:

1. Gateway returns `202 Accepted` with `Retry-After: 120` header
2. Gateway records pending demand metric
3. KEDA detects queue depth > activation threshold
4. GKE provisions GPU node (60-120 seconds)
5. Worker pod starts and registers with the gateway
6. Client retries and request succeeds

### Configuration

```yaml
autoscaling:
  enabled: true
  prometheusAddress: http://prometheus-operated.monitoring.svc:9090
  pollingInterval: 15          # Check metrics every 15s
  cooldownPeriod: 900          # Wait 15 min before scaling to zero
  scaleDownStabilization: 300  # 5 min stabilization window
  queueDepthThreshold: 10      # Scale up at 10 pending requests/pod
  queueDepthActivation: 2      # Activate from zero at 2 requests
  fallbackReplicas: 2          # Fallback if Prometheus unavailable
```

### Cold Start Expectations

When scaling from zero, expect these timelines:

| Phase | Duration | What Happens |
|-------|----------|--------------|
| Node provisioning | 2-5 min | GKE finds a GPU node (spot may take longer) |
| Container startup | 20-40s | Pull image, start process |
| Model loading | 10-120s | Load weights to GPU (from cache or HuggingFace) |

**Total: 3-7 minutes** from first request to first response. See [Scale-from-Zero](/docs/deployment/autoscaling/) for the full flow and troubleshooting.

### Cost Optimization

GPU nodes scale to zero during idle periods. Configure cooldown based on your traffic patterns:

- **Consistent traffic**: Lower cooldown (300s) for responsive scaling
- **Bursty traffic**: Higher cooldown (900s) to avoid thrashing
- **Dev/test**: Use spot instances for 60-70% cost savings

---

## Terraform Setup

The `examples/dev-l4-spot` example in [`superlinked/terraform-google-sie`](https://github.com/superlinked/terraform-google-sie) provisions a complete GKE cluster with an L4 spot GPU pool via the published `superlinked/sie/google` Terraform registry module.

### Prerequisites

1. GCP project with billing enabled.
2. GPU quota for `nvidia-l4` in your region:

   ```bash
   gcloud compute regions describe REGION \
     --format='table(quotas.filter(metric:NVIDIA))'
   ```

   The `dev-l4-spot` example uses spot, so look for `PREEMPTIBLE_NVIDIA_L4_GPUS`. Anything ≥ 4 covers the example's max of 5 nodes × 1 GPU.

3. Required APIs enabled:

   ```bash
   gcloud services enable \
     container.googleapis.com \
     compute.googleapis.com \
     artifactregistry.googleapis.com \
     iam.googleapis.com
   ```

4. Authenticated:

   ```bash
   gcloud auth application-default login
   ```

### Initialize

```bash
git clone https://github.com/superlinked/terraform-google-sie.git
cd terraform-google-sie/examples/dev-l4-spot

# Set project ID
export TF_VAR_project_id="your-project-id"

# Initialize Terraform
terraform init
```

### Plan and Apply

```bash
# Review changes
terraform plan

# Deploy cluster (15-20 minutes)
terraform apply
```

### Configure kubectl

```bash
# Get credentials
$(terraform output -raw kubectl_command)

# Verify cluster
kubectl get nodes
```

### Variables

Key configuration options for the `superlinked/sie/google` module:

| Variable | Default | Description |
|----------|---------|-------------|
| `project_id` | (required) | GCP project ID |
| `region` | `us-central1` | GKE cluster region |
| `cluster_name` | `sie-dev` | Name of the GKE cluster |
| `gpu_node_pools` | L4 pool | List of GPU node pool configurations |
| `create_artifact_registry` | `true` | Provision an Artifact Registry for custom images |
| `deployer_service_account` | `""` | Email of the SA running Terraform (optional, for CI/CD) |

### Example: Production Multi-GPU

```hcl
module "sie_gke" {
  source  = "superlinked/sie/google"
  version = "0.3.4"

  project_id   = "my-project"
  region       = "us-central1"
  cluster_name = "sie-prod"

  gpu_node_pools = [
    {
      name           = "l4-pool"
      machine_type   = "g2-standard-8"
      gpu_type       = "nvidia-l4"
      gpu_count      = 1
      min_node_count = 1    # Keep 1 warm
      max_node_count = 20
      spot           = false
    },
    {
      name           = "a100-pool"
      machine_type   = "a2-highgpu-1g"
      gpu_type       = "nvidia-tesla-a100"
      gpu_count      = 1
      min_node_count = 0
      max_node_count = 10
      spot           = true
    }
  ]
}
```

---

## Helm Installation

Deploy SIE to an existing GKE cluster using Helm. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to `install: false`. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and the observability stack described elsewhere on this page, add the following to the install command:

```bash
--set keda.install=true \
--set autoscaling.enabled=true \
--set kube-prometheus-stack.install=true \
--set dcgm-exporter.install=true
```

### Prerequisites

- GKE cluster with GPU node pools (the Terraform setup above creates this)
- `HF_TOKEN` exported if you need gated models. Optional for the `BAAI/bge-m3` smoke test; in that case, omit **both** `--set hfToken.create=true` and `--set hfToken.value=...` entirely (leaving `HF_TOKEN` unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

### Install

Extract the Workload Identity service-account email from the terraform output and wire it into the chart via `--set`. The example also enables the L4 worker pool explicitly — the chart's worker pools default to `enabled: false`.

```bash
# The `workload_identity_annotation` output is the full `key=email` pair;
# strip the prefix to get just the SA email for the --set value.
WI_SA=$(terraform output -raw workload_identity_annotation | cut -d= -f2)

helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.3.4 \
  -n sie --create-namespace \
  --set "serviceAccount.annotations.iam\.gke\.io/gcp-service-account=$WI_SA" \
  --set workers.pools.l4.enabled=true \
  --set workers.pools.l4.minReplicas=1 \
  --set hfToken.create=true \
  --set hfToken.value="$HF_TOKEN"

# Wait for rollout
kubectl -n sie get pods -w
```

`minReplicas: 1` keeps one L4 worker always running, which is the simplest path to a working smoke test without KEDA installed. For true scale-from-zero, additionally pass `--set keda.install=true --set autoscaling.enabled=true` and set `minReplicas: 0`.

### Custom Values

```yaml
# custom-values.yaml
gateway:
  replicas: 3

workers:
  common:
    bundle: default
    cacheVolumeSize: 100Gi
    clusterCache:
      enabled: true
      url: gs://my-bucket/models

  pools:
    l4:
      enabled: true
      minReplicas: 1
      maxReplicas: 20

autoscaling:
  enabled: true
  cooldownPeriod: 300

ingress:
  enabled: true
  host: sie.example.com
  tls:
    enabled: true
    secretName: sie-tls

auth:
  enabled: true
  oauth2Proxy:
    oidcIssuerUrl: https://auth.example.com/realms/sie

serviceMonitor:
  enabled: true
```

### Upgrade

```bash
helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.3.4 \
  -n sie
```

### Verify

```bash
# Check pods
kubectl get pods -n sie

# Check gateway logs
kubectl logs -n sie -l app.kubernetes.io/component=gateway

# Port-forward the gateway and run a smoke test
kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &

# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk

python3 -c "
from sie_sdk import SIEClient

client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
                       gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape)  # (1024,)
"
```

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See [Scale-from-Zero](/docs/deployment/autoscaling/) for the full flow.

### Cleanup

```bash
helm uninstall sie -n sie
terraform destroy
```

### Access + Auth

- **Ingress controller**: use ingress-nginx for public or private access.
- **Public vs private**: set ingress-nginx service annotations for internal LBs on GKE.
- **Auth options**:
  - OIDC (oauth2-proxy) with external IdP or Dex.
  - Static token (gateway-level) for OSS/self-hosted without IdP.
  - No auth + private ingress (internal LB).

```bash
# Static token mode for self-hosted clusters
kubectl create secret generic sie-auth-tokens -n sie \
  --from-literal=SIE_AUTH_TOKEN="key1,key2,key3"

helm upgrade sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.3.4 \
  -n sie \
  --set gateway.auth.mode=static \
  --set gateway.auth.tokenSecretName=sie-auth-tokens
```

Debug-only access via port-forward is still possible, but production paths should use ingress.

---

## What's Next

- [Upgrade Runbook](/docs/deployment/upgrades/) - pre-upgrade checklist, rolling updates, and rollback
- [Scale-from-Zero](/docs/deployment/autoscaling/) - understanding the 202 flow and cold starts
- [Kubernetes in AWS](/docs/deployment/cloud-aws/) - equivalent EKS deployment
- [Monitoring & Observability](/docs/deployment/monitoring/) - metrics, logging, and dashboards