Kubernetes in AWS
Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.
Architecture
Section titled “Architecture”The architecture mirrors the GCP deployment, with a gateway/config/worker setup and KEDA autoscaling:
Components:
- EKS Cluster with managed node groups for GPU instances
- NVIDIA Device Plugin for GPU scheduling
- IRSA (IAM Roles for Service Accounts) for S3 access
- KEDA for autoscaling based on queue depth metrics
- Prometheus + Grafana + DCGM Exporter for observability
Terraform Setup
Section titled “Terraform Setup”The examples/dev-g6-spot example in superlinked/terraform-aws-sie consumes the published superlinked/sie/aws Terraform registry module, the same module used in production deployments, pinned to a known-good version.
Prerequisites
Section titled “Prerequisites”-
AWS account with appropriate permissions.
-
EC2 quota for
g6.2xlarge(NVIDIA L4) in your target region (default:eu-central-1). AWS quotas G/VT family by total vCPU, separately for on-demand and spot. Thedev-g6-spotexample uses spot, so checkAll G and VT Spot Instance Requests(quota codeL-3819A6DF):aws service-quotas list-service-quotas --service-code ec2 --region eu-central-1 \--query 'Quotas[?QuotaCode==`L-3819A6DF`].{Name:QuotaName,Value:Value}' \--output tableg6.2xlargeis 8 vCPU per node; the example scales 0–5 nodes, so anything ≥ 40 is sufficient. -
Terraform >= 1.14 and AWS CLI v2 configured.
Deploy
Section titled “Deploy”git clone https://github.com/superlinked/terraform-aws-sie.gitcd terraform-aws-sie/examples/dev-g6-spot
# Initialize and apply (creates an EKS cluster, ~15-20 min)terraform initterraform applyThe example main.tf pins the module version:
module "sie_eks" { source = "superlinked/sie/aws" version = "0.3.4"
aws_region = var.aws_region project_name = var.project_name gpu_instance_type = "g6.2xlarge" gpu_capacity_type = "SPOT" gpu_min_size = 0 gpu_max_size = 5}For multi-GPU production setups, use the gpu_node_groups list variable instead of the single-GPU gpu_* variables. See the module variables reference.
If your AWS account already manages SIE ECR repos from another stack (e.g. a shared CI account or a previous deployment), set create_ecr_repositories = false on the module call to skip ECR resource creation. The module still emits the ecr_*_repository_url outputs from caller identity + repo names, so IRSA / Helm wiring is unchanged either way.
What Gets Created
Section titled “What Gets Created”The Terraform module provisions:
| Resource | Purpose |
|---|---|
| EKS Cluster | Kubernetes control plane |
| GPU Node Group | Auto-scaling g6.2xlarge L4 spot instances (0–5 nodes) |
| NVIDIA Device Plugin | GPU scheduling in Kubernetes |
| IRSA Role | Workload identity for SIE pods (no static AWS credentials) |
| ECR Repositories | Created for optional custom images. The chart pulls public images from GHCR by default. |
Helm Installation
Section titled “Helm Installation”Once the cluster is up, configure kubectl and install the sie-cluster chart. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to install: false. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and observability stack, add --set keda.install=true --set autoscaling.enabled=true --set kube-prometheus-stack.install=true --set dcgm-exporter.install=true to the install command.
# Configure kubectl from the terraform output$(terraform output -raw kubectl_config_command)
# Install SIE (pulls the chart from GHCR, wires up IRSA from the terraform output)# `workers.pools.l4.enabled=true` is required — the chart's pools default to enabled: false.IRSA_ARN=$(terraform output -raw sie_irsa_role_arn)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \ --version 0.3.4 \ -n sie --create-namespace \ --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=$IRSA_ARN" \ --set workers.pools.l4.enabled=true \ --set workers.pools.l4.minReplicas=1 \ --set hfToken.create=true \ --set hfToken.value="$HF_TOKEN"
# Wait for rolloutkubectl -n sie get pods -wSet HF_TOKEN beforehand if you need gated models. For the smoke test below (BAAI/bge-m3) it is optional; in that case, omit both --set hfToken.create=true and --set hfToken.value=... entirely (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).
minReplicas: 1 keeps one L4 worker always running — the simplest path to a working smoke test without KEDA. For scale-from-zero, additionally pass --set keda.install=true --set autoscaling.enabled=true and set minReplicas: 0.
Smoke Test
Section titled “Smoke Test”kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &
# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)pip install sie-sdk
python3 -c "from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')result = client.encode('BAAI/bge-m3', {'text': 'hello world'}, gpu='l4', wait_for_capacity=True, provision_timeout_s=600)print(result['dense'].shape) # (1024,)"The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.
Cleanup
Section titled “Cleanup”helm uninstall sie -n sieterraform destroyDifferences from GCP
Section titled “Differences from GCP”| Feature | GCP (GKE) | AWS (EKS) |
|---|---|---|
| GPU scheduling | Native GKE support | NVIDIA Device Plugin required |
| IAM for pods | Workload Identity | IRSA |
| Model cache storage | GCS (gs://) | S3 (s3://) |
| Node provisioning | GKE Autopilot / NAP | Karpenter or Cluster Autoscaler |
| Spot instances | Spot VMs | Spot Instances |
S3 for Model Cache
Section titled “S3 for Model Cache”Configure the cluster cache to use S3:
workers: common: clusterCache: enabled: true url: s3://my-bucket/modelsIRSA handles authentication automatically - no access keys needed in the pod.
Security Considerations
Section titled “Security Considerations”The default Terraform configuration exposes the API endpoint publicly. For production:
- Restrict ingress to your VPC CIDR or specific IP ranges
- Enable authentication via oauth2-proxy or static tokens
- Use a private load balancer for internal-only access:
ingress: enabled: true annotations: service.beta.kubernetes.io/aws-load-balancer-internal: "true"Docker on AWS (Alternative)
Section titled “Docker on AWS (Alternative)”For simpler deployments, run SIE directly on a GPU EC2 instance:
# On a g6.xlarge (NVIDIA L4) instancesudo apt-get install -y nvidia-container-toolkitsudo systemctl restart docker
docker run --gpus all -p 8080:8080 \ -v ~/.cache/huggingface:/app/.cache/huggingface \ ghcr.io/superlinked/sie-server:latest-cuda12-defaultThis is simpler than EKS and suitable for single-instance production workloads.
What’s Next
Section titled “What’s Next”- Upgrade Runbook - pre-upgrade checklist, rolling updates, and rollback
- Scale-from-Zero - understanding the 202 flow and cold starts
- Monitoring - metrics, alerts, and dashboards
- Kubernetes in GCP - equivalent GKE deployment