How Do You Deploy an Embedding Model on AWS?
Deploying an embedding model on AWS with SIE requires three steps: configure your GPU cluster using the SIE Terraform module, deploy the inference server with Helm, and connect using the SIE SDK. The process takes under 30 minutes and produces a production-grade inference cluster in your own AWS account with automatic scaling and model management.
Why deploy on AWS rather than using a managed API?
Managed embedding APIs (OpenAI, Cohere, Voyage) charge per token. At production scale (encoding millions of documents or serving real-time queries), per-token costs become prohibitive. Self-hosting on your own AWS GPUs reduces inference costs by up to 50×, keeps data within your AWS account (critical for compliance), and gives you full control over model selection and configuration.
Prerequisites
Before deploying SIE on AWS, you need:
- An AWS account with permissions to create EC2 instances, VPCs, and IAM roles
- Terraform installed locally (
>= 1.3) - Helm installed locally (
>= 3.0) kubectlconfigured to connect to your cluster- The
sie-sdkPython package (pip install sie-sdk)
Step 1: Configure the cluster with Terraform
SIE provides an official AWS Terraform module that provisions the GPU infrastructure:
# main.tfmodule "sie" { source = "superlinked/sie/aws" version = "~> 1.0"
region = "us-east-1"
# GPU configuration — mix on-demand and spot for cost efficiency gpus = ["a100-40gb", "l4-spot"]
# VPC configuration (use existing or let SIE create one) vpc_id = "vpc-xxxxxxxx" # optional — omit to create new subnet_ids = ["subnet-xxx"] # optional
# Cluster sizing min_nodes = 1 max_nodes = 4}
output "sie_endpoint" { value = module.sie.endpoint}terraform initterraform plan # review what will be createdterraform apply # provision the cluster (~10 minutes)This creates an EKS cluster with GPU node groups, auto-scaling policies, IAM roles, and a load balancer endpoint.
Step 2: Deploy the inference server with Helm
Once the cluster is provisioned, deploy the SIE inference server:
# Add the SIE Helm repositoryhelm repo add sie oci://ghcr.io/superlinked/charts
# Deploy the cluster charthelm install sie sie/sie-cluster \ --namespace sie \ --create-namespace \ --set replicaCount=2 \ --set autoscaling.enabled=true \ --set autoscaling.maxReplicas=8The Helm chart deploys the SIE server pods, configures health checks, and sets up horizontal pod autoscaling based on GPU utilisation.
Step 3: Verify the deployment
# Check pods are runningkubectl get pods -n sie
# Check the endpoint is healthycurl https://<your-sie-endpoint>/health# → {"status": "ok", "models_loaded": 0}Step 4: Start encoding
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("https://<your-sie-endpoint>")
# Load a model on first use (cached on GPU after first request)results = client.encode( "BAAI/bge-m3", [ Item(text="self-hosted inference on AWS"), Item(text="reduce embedding API costs with your own GPU"), ],)
print(f"Encoded {len(results)} documents, dim={len(results[0]['dense'])}")Models are downloaded from Hugging Face on first use and cached on the GPU node. Subsequent requests are served from the cache with full GPU throughput.
GPU selection guide for AWS
| GPU | Instance type | Best for |
|---|---|---|
| A100 40GB | p4d.24xlarge | Largest models, highest throughput |
| A100 80GB | p4de.24xlarge | Very large models (7B+) |
| L4 | g6.xlarge (spot) | Cost-efficient, medium throughput |
| T4 | g4dn.xlarge | Low-cost dev/staging |
For most embedding workloads (BGE-M3, E5-large), L4 spot instances offer the best cost-performance ratio. Use A100s for 7B+ instruction-following models.
Cost optimisation tips
Use spot instances: L4 spot instances reduce GPU costs by 60-70% vs on-demand. SIE handles spot interruption gracefully.
Enable autoscaling: scale down to 1 node during off-peak hours, scale up for batch indexing jobs.
Batch encode: send large batches per request to maximise GPU utilisation:
# Encode 1000 documents in one call — much more efficient than 1000 individual callsfrom sie_sdk.types import Item
results = client.encode( "BAAI/bge-m3", [Item(text=d) for d in document_batch_of_1000],)vectors = [r["dense"] for r in results]Share the cluster across models: SIE’s multi-model support means one cluster serves all your embedding, reranking, and extraction needs.
Frequently asked questions
How long does deployment take? Terraform provisioning takes ~10 minutes. Helm deployment takes ~3 minutes. Total time to first encode request: under 20 minutes.
Is my data sent outside my AWS account? No. SIE runs entirely within your AWS account. Encoding requests go from your application to your SIE cluster, never to Superlinked’s servers.
What AWS regions are supported?
SIE supports all AWS regions with GPU instance availability: us-east-1, us-west-2, eu-west-1, ap-southeast-1, and others. Specify region in the Terraform module.
Do I need Kubernetes experience?
Basic familiarity with kubectl and Helm is helpful. SIE’s Terraform module abstracts the EKS cluster setup, so you don’t need deep Kubernetes expertise.