Skip to content
Why did we open-source our inference engine? Read the post

Kubernetes in AWS

Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.

The architecture mirrors the GCP deployment, with a gateway/config/worker setup and KEDA autoscaling:

EKS cluster architecture with Gateway, Config service, L4 and A100 worker pools, KEDA, and Prometheus

Components:

  • EKS Cluster with managed node groups for GPU instances
  • NVIDIA Device Plugin for GPU scheduling
  • IRSA (IAM Roles for Service Accounts) for S3 access
  • KEDA for autoscaling based on queue depth metrics
  • Prometheus + Grafana + DCGM Exporter for observability

The examples/dev-g6-spot example in superlinked/terraform-aws-sie consumes the published superlinked/sie/aws Terraform registry module, the same module used in production deployments, pinned to a known-good version.

  1. AWS account with appropriate permissions.

  2. EC2 quota for g6.2xlarge (NVIDIA L4) in your target region (default: eu-central-1). AWS quotas G/VT family by total vCPU, separately for on-demand and spot. The dev-g6-spot example uses spot, so check All G and VT Spot Instance Requests (quota code L-3819A6DF):

    aws service-quotas list-service-quotas --service-code ec2 --region eu-central-1 \
    --query 'Quotas[?QuotaCode==`L-3819A6DF`].{Name:QuotaName,Value:Value}' \
    --output table

    g6.2xlarge is 8 vCPU per node; the example scales 0–5 nodes, so anything ≥ 40 is sufficient.

  3. Terraform >= 1.14 and AWS CLI v2 configured.

git clone https://github.com/superlinked/terraform-aws-sie.git
cd terraform-aws-sie/examples/dev-g6-spot
# Initialize and apply (creates an EKS cluster, ~15-20 min)
terraform init
terraform apply

The example main.tf pins the module version:

module "sie_eks" {
source = "superlinked/sie/aws"
version = "0.3.4"
aws_region = var.aws_region
project_name = var.project_name
gpu_instance_type = "g6.2xlarge"
gpu_capacity_type = "SPOT"
gpu_min_size = 0
gpu_max_size = 5
}

For multi-GPU production setups, use the gpu_node_groups list variable instead of the single-GPU gpu_* variables. See the module variables reference.

If your AWS account already manages SIE ECR repos from another stack (e.g. a shared CI account or a previous deployment), set create_ecr_repositories = false on the module call to skip ECR resource creation. The module still emits the ecr_*_repository_url outputs from caller identity + repo names, so IRSA / Helm wiring is unchanged either way.

The Terraform module provisions:

ResourcePurpose
EKS ClusterKubernetes control plane
GPU Node GroupAuto-scaling g6.2xlarge L4 spot instances (0–5 nodes)
NVIDIA Device PluginGPU scheduling in Kubernetes
IRSA RoleWorkload identity for SIE pods (no static AWS credentials)
ECR RepositoriesCreated for optional custom images. The chart pulls public images from GHCR by default.

Once the cluster is up, configure kubectl and install the sie-cluster chart. The chart packages KEDA, kube-prometheus-stack, DCGM Exporter, Loki, and Alloy as optional sub-charts; they default to install: false. The smoke test below works with just the core services (gateway, config, worker, NATS). To enable the KEDA-based autoscaling and observability stack, add --set keda.install=true --set autoscaling.enabled=true --set kube-prometheus-stack.install=true --set dcgm-exporter.install=true to the install command.

# Configure kubectl from the terraform output
$(terraform output -raw kubectl_config_command)
# Install SIE (pulls the chart from GHCR, wires up IRSA from the terraform output)
# `workers.pools.l4.enabled=true` is required — the chart's pools default to enabled: false.
IRSA_ARN=$(terraform output -raw sie_irsa_role_arn)
helm upgrade --install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--version 0.3.4 \
-n sie --create-namespace \
--set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=$IRSA_ARN" \
--set workers.pools.l4.enabled=true \
--set workers.pools.l4.minReplicas=1 \
--set hfToken.create=true \
--set hfToken.value="$HF_TOKEN"
# Wait for rollout
kubectl -n sie get pods -w

Set HF_TOKEN beforehand if you need gated models. For the smoke test below (BAAI/bge-m3) it is optional; in that case, omit both --set hfToken.create=true and --set hfToken.value=... entirely (leaving HF_TOKEN unset with the flags present creates an empty-token secret that will fail later on any gated-model request).

minReplicas: 1 keeps one L4 worker always running — the simplest path to a working smoke test without KEDA. For scale-from-zero, additionally pass --set keda.install=true --set autoscaling.enabled=true and set minReplicas: 0.

kubectl -n sie port-forward svc/sie-sie-cluster-gateway 8080:8080 &
# Install the Python SDK (requires Python 3.12 — see the SDK README for newer/older Python notes)
pip install sie-sdk
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'hello world'},
gpu='l4', wait_for_capacity=True, provision_timeout_s=600)
print(result['dense'].shape) # (1024,)
"

The first request after scale-from-zero takes ~5–10 minutes (node provisioning + image pull + model loading). See Scale-from-Zero for the full flow.

helm uninstall sie -n sie
terraform destroy

FeatureGCP (GKE)AWS (EKS)
GPU schedulingNative GKE supportNVIDIA Device Plugin required
IAM for podsWorkload IdentityIRSA
Model cache storageGCS (gs://)S3 (s3://)
Node provisioningGKE Autopilot / NAPKarpenter or Cluster Autoscaler
Spot instancesSpot VMsSpot Instances

Configure the cluster cache to use S3:

workers:
common:
clusterCache:
enabled: true
url: s3://my-bucket/models

IRSA handles authentication automatically - no access keys needed in the pod.


The default Terraform configuration exposes the API endpoint publicly. For production:

  • Restrict ingress to your VPC CIDR or specific IP ranges
  • Enable authentication via oauth2-proxy or static tokens
  • Use a private load balancer for internal-only access:
ingress:
enabled: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"

For simpler deployments, run SIE directly on a GPU EC2 instance:

# On a g6.xlarge (NVIDIA L4) instance
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie-server:latest-cuda12-default

This is simpler than EKS and suitable for single-instance production workloads.


Contact us

Tell us about your use case and we'll get back to you shortly.