---
title: How Do You Deploy an Embedding Model on AWS?
description: "Deploying an embedding model on AWS with SIE requires three steps: configure your GPU cluster using the SIE Terraform module, deploy the inference server with Helm, and connect using the SIE SDK. The process takes under 30 minutes and produces a production-grade inference cluster in your own AWS account with automat..."
canonical_url: https://superlinked.com/glossary/how-to-deploy-embedding-model-on-aws
last_updated: 2026-06-01
---

# How Do You Deploy an Embedding Model on AWS?

Deploying an embedding model on AWS with SIE requires three steps: configure your GPU cluster using the SIE Terraform module, deploy the inference server with Helm, and connect using the SIE SDK. The process takes under 30 minutes and produces a production-grade inference cluster in your own AWS account with automatic scaling and model management.

---

## Why deploy on AWS rather than using a managed API?

Managed embedding APIs (OpenAI, Cohere, Voyage) charge per token. At production scale (encoding millions of documents or serving real-time queries), per-token costs become prohibitive. Self-hosting on your own AWS GPUs reduces inference costs by up to 50×, keeps data within your AWS account (critical for compliance), and gives you full control over model selection and configuration.

---

## Prerequisites

Before deploying SIE on AWS, you need:

- An AWS account with permissions to create EC2 instances, VPCs, and IAM roles
- Terraform installed locally (`>= 1.3`)
- Helm installed locally (`>= 3.0`)
- `kubectl` configured to connect to your cluster
- The `sie-sdk` Python package (`pip install sie-sdk`)

---

## Step 1: Configure the cluster with Terraform

SIE provides an official AWS Terraform module that provisions the GPU infrastructure:

```hcl
# main.tf
module "sie" {
  source  = "superlinked/sie/aws"
  version = "~> 1.0"

  region = "us-east-1"

  # GPU configuration — mix on-demand and spot for cost efficiency
  gpus = ["a100-40gb", "l4-spot"]

  # VPC configuration (use existing or let SIE create one)
  vpc_id     = "vpc-xxxxxxxx"   # optional — omit to create new
  subnet_ids = ["subnet-xxx"]   # optional

  # Cluster sizing
  min_nodes = 1
  max_nodes = 4
}

output "sie_endpoint" {
  value = module.sie.endpoint
}
```

```bash
terraform init
terraform plan    # review what will be created
terraform apply   # provision the cluster (~10 minutes)
```

This creates an EKS cluster with GPU node groups, auto-scaling policies, IAM roles, and a load balancer endpoint.

---

## Step 2: Deploy the inference server with Helm

Once the cluster is provisioned, deploy the SIE inference server:

```bash
# Add the SIE Helm repository
helm repo add sie oci://ghcr.io/superlinked/charts

# Deploy the cluster chart
helm install sie sie/sie-cluster \
  --namespace sie \
  --create-namespace \
  --set replicaCount=2 \
  --set autoscaling.enabled=true \
  --set autoscaling.maxReplicas=8
```

The Helm chart deploys the SIE server pods, configures health checks, and sets up horizontal pod autoscaling based on GPU utilisation.

---

## Step 3: Verify the deployment

```bash
# Check pods are running
kubectl get pods -n sie

# Check the endpoint is healthy
curl https://<your-sie-endpoint>/health
# → {"status": "ok", "models_loaded": 0}
```

---

## Step 4: Start encoding

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("https://<your-sie-endpoint>")

# Load a model on first use (cached on GPU after first request)
results = client.encode(
    "BAAI/bge-m3",
    [
        Item(text="self-hosted inference on AWS"),
        Item(text="reduce embedding API costs with your own GPU"),
    ],
)

print(f"Encoded {len(results)} documents, dim={len(results[0]['dense'])}")
```

Models are downloaded from Hugging Face on first use and cached on the GPU node. Subsequent requests are served from the cache with full GPU throughput.

---

## GPU selection guide for AWS

| GPU | Instance type | Best for |
|---|---|---|
| A100 40GB | p4d.24xlarge | Largest models, highest throughput |
| A100 80GB | p4de.24xlarge | Very large models (7B+) |
| L4 | g6.xlarge (spot) | Cost-efficient, medium throughput |
| T4 | g4dn.xlarge | Low-cost dev/staging |

For most embedding workloads (BGE-M3, E5-large), L4 spot instances offer the best cost-performance ratio. Use A100s for 7B+ instruction-following models.

---

## Cost optimisation tips

**Use spot instances**: L4 spot instances reduce GPU costs by 60-70% vs on-demand. SIE handles spot interruption gracefully.

**Enable autoscaling**: scale down to 1 node during off-peak hours, scale up for batch indexing jobs.

**Batch encode**: send large batches per request to maximise GPU utilisation:
```python
# Encode 1000 documents in one call — much more efficient than 1000 individual calls
from sie_sdk.types import Item

results = client.encode(
    "BAAI/bge-m3",
    [Item(text=d) for d in document_batch_of_1000],
)
vectors = [r["dense"] for r in results]
```

**Share the cluster across models**: SIE's multi-model support means one cluster serves all your embedding, reranking, and extraction needs.

---

## Frequently asked questions

**How long does deployment take?**
Terraform provisioning takes ~10 minutes. Helm deployment takes ~3 minutes. Total time to first encode request: under 20 minutes.

**Is my data sent outside my AWS account?**
No. SIE runs entirely within your AWS account. Encoding requests go from your application to your SIE cluster, never to Superlinked's servers.

**What AWS regions are supported?**
SIE supports all AWS regions with GPU instance availability: us-east-1, us-west-2, eu-west-1, ap-southeast-1, and others. Specify `region` in the Terraform module.

**Do I need Kubernetes experience?**
Basic familiarity with `kubectl` and Helm is helpful. SIE's Terraform module abstracts the EKS cluster setup, so you don't need deep Kubernetes expertise.

---

## Related resources

- [SIE deployment documentation](/docs/deployment)
- [SIE vs TEI vs OpenAI benchmark](/docs/examples/benchmark)
- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
- [How do you deploy on GCP?](/glossary/how-to-deploy-embedding-model-on-gcp)
- [What is GPU utilisation in inference?](/glossary/what-is-gpu-utilisation-in-inference)
