---
title: Kubernetes in AWS
description: Deploy SIE on Amazon EKS with GPU autoscaling.
canonical_url: https://superlinked.com/docs/deployment/cloud-aws
last_updated: 2026-05-18
---

Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.

## Architecture

The architecture mirrors the [GCP deployment](/docs/deployment/cloud-gcp/), with a gateway/config/worker setup and KEDA autoscaling:

![EKS cluster architecture with Gateway, Config service, L4 and A100 worker pools, KEDA, and Prometheus](/diagrams/eks-arch.svg)

**Components:**
- **EKS Cluster** with managed node groups for GPU instances
- **NVIDIA Device Plugin** for GPU scheduling
- **IRSA** (IAM Roles for Service Accounts) for S3 access
- **KEDA** for autoscaling based on queue depth metrics
- **Prometheus + Grafana + DCGM Exporter** for observability

---

## Terraform Setup

The `examples/dev-g6-spot` example in [`superlinked/terraform-aws-sie`](https://github.com/superlinked/terraform-aws-sie) consumes the published `superlinked/sie/aws` Terraform registry module, the same module used in production deployments, pinned to a known-good version.

### Prerequisites

1. AWS account with appropriate permissions.
2. EC2 quota for `g6.2xlarge` (NVIDIA L4) in your target region (default: `eu-central-1`). AWS quotas G/VT family by total vCPU, separately for on-demand and spot. The `dev-g6-spot` example uses spot, so check `All G and VT Spot Instance Requests` (quota code `L-3819A6DF`):

   ```bash
   aws service-quotas list-service-quotas --service-code ec2 --region eu-central-1 \
     --query 'Quotas[?QuotaCode==`L-3819A6DF`].{Name:QuotaName,Value:Value}' \
     --output table
   ```

   `g6.2xlarge` is 8 vCPU per node; the example scales 0–5 nodes, so anything ≥ 40 is sufficient.

3. Terraform >= 1.14 and AWS CLI v2 configured.

### Deploy

```bash
git clone https://github.com/superlinked/terraform-aws-sie.git
cd terraform-aws-sie/examples/dev-g6-spot

# Initialize and apply (creates an EKS cluster, ~15-20 min)
terraform init
terraform apply
```

The example `main.tf` pins the module version:

```hcl
module "sie_eks" {
  source  = "superlinked/sie/aws"
  version = "0.3.4"

  aws_region        = var.aws_region
  project_name      = var.project_name
  gpu_instance_type = "g6.2xlarge"
  gpu_capacity_type = "SPOT"
  gpu_min_size      = 0
  gpu_max_size      = 5
}
```

For multi-GPU production setups, use the `gpu_node_groups` list variable instead of the single-GPU `gpu_*` variables. See the [module variables reference](https://github.com/superlinked/terraform-aws-sie/blob/main/variables.tf).

If your AWS account already manages SIE ECR repos from another stack (e.g. a shared CI account or a previous deployment), set `create_ecr_repositories = false` on the module call to skip ECR resource creation. The module still emits the `ecr_*_repository_url` outputs from caller identity + repo names, so IRSA / Helm wiring is unchanged either way.

### What Gets Created

The Terraform module provisions:

| Resource | Purpose |
|----------|---------|
| EKS Cluster | Kubernetes control plane |
| GPU Node Group | Auto-scaling `g6.2xlarge` L4 spot instances (0–5 nodes) |
| NVIDIA Device Plugin | GPU scheduling in Kubernetes |
| IRSA Role | Workload identity for SIE pods (no static AWS credentials) |
| ECR Repositories | Created for optional custom images. The chart pulls public images from GHCR by default. |

---

## Helm Installation

Once the cluster is up, configure `kubectl` and install the `sie-cluster` chart. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to `install: false`. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and observability stack, add `--set keda.install=true --set autoscaling.enabled=true --set kube-prometheus-stack.install=true --set dcgm-exporter.install=true` to the install command.

```bash
# Configure kubectl from the terraform output
$(terraform output -raw kubectl_config_command)

# Install SIE (pulls the chart from GHCR, wires up IRSA from the terraform output)
# `workers.pools.l4.enabled=true` is required — the chart's pools default to enabled: false.
IRSA_ARN=$(terraform output -raw sie_irsa_role_arn)

helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --version 0.3.4 \
  -n sie --create-namespace \
  --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=$IRSA_ARN" \
  --set workers.pools.l4.enabled=true \
  --set workers.pools.l4.minReplicas=1 \
  --set hfToken.create=true \
  --set hfToken.value="$HF_TOKEN"

# Wait for rollout
kubectl -n sie get pods -w
```

Set `HF_TOKEN` beforehand if you need gated models. For the smoke test below (`BAAI/bge-m3`) it is optional; in that case, omit **both** `--set hfToken.create=true` and `--set hfToken.value=...` entirely (leaving `HF_TOKEN` unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

`minReplicas: 1` keeps one L4 worker always running — the simplest path to a working smoke test without KEDA. For scale-from-zero, additionally pass `--set keda.install=true --set autoscaling.enabled=true` and set `minReplicas: 0`.

### Smoke Test

```bash
kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &

# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk

python3 -c "
from sie_sdk import SIEClient

client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
                       gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape)  # (1024,)
"
```

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See [Scale-from-Zero](/docs/deployment/autoscaling/) for the full flow.

### Cleanup

```bash
helm uninstall sie -n sie
terraform destroy
```

---

## Differences from GCP

| Feature | GCP (GKE) | AWS (EKS) |
|---------|-----------|-----------|
| GPU scheduling | Native GKE support | NVIDIA Device Plugin required |
| IAM for pods | Workload Identity | IRSA |
| Model cache storage | GCS (`gs://`) | S3 (`s3://`) |
| Node provisioning | GKE Autopilot / NAP | Karpenter or Cluster Autoscaler |
| Spot instances | Spot VMs | Spot Instances |

### S3 for Model Cache

Configure the cluster cache to use S3:

```yaml
workers:
  common:
    clusterCache:
      enabled: true
      url: s3://my-bucket/models
```

IRSA handles authentication automatically - no access keys needed in the pod.

---

## Security Considerations

The default Terraform configuration exposes the API endpoint publicly. For production:

- **Restrict ingress** to your VPC CIDR or specific IP ranges
- **Enable authentication** via oauth2-proxy or static tokens
- **Use a private load balancer** for internal-only access:

```yaml
ingress:
  enabled: true
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
```

---

## Docker on AWS (Alternative)

For simpler deployments, run SIE directly on a GPU EC2 instance:

```bash
# On a g6.xlarge (NVIDIA L4) instance
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

docker run --gpus all -p 8080:8080 \
  -v ~/.cache/huggingface:/app/.cache/huggingface \
  ghcr.io/superlinked/sie-server:latest-cuda12-default
```

This is simpler than EKS and suitable for single-instance production workloads.

---

## What's Next

- [Upgrade Runbook](/docs/deployment/upgrades/) - pre-upgrade checklist, rolling updates, and rollback
- [Scale-from-Zero](/docs/deployment/autoscaling/) - understanding the 202 flow and cold starts
- [Monitoring](/docs/deployment/monitoring/) - metrics, alerts, and dashboards
- [Kubernetes in GCP](/docs/deployment/cloud-gcp/) - equivalent GKE deployment
