Inference

How Do You Deploy an Embedding Model on AWS?

Deploying an embedding model on AWS with SIE requires three steps: configure your GPU cluster using the SIE Terraform module, deploy the inference server with Helm, and connect using the SIE SDK. The process takes under 30 minutes and produces a production-grade inference cluster in your own AWS account with automatic scaling and model management.

Why deploy on AWS rather than using a managed API?

Managed embedding APIs (OpenAI, Cohere, Voyage) charge per token. At production scale (encoding millions of documents or serving real-time queries), per-token costs become prohibitive. Self-hosting on your own AWS GPUs reduces inference costs by up to 50×, keeps data within your AWS account (critical for compliance), and gives you full control over model selection and configuration.

Prerequisites

Before deploying SIE on AWS, you need:

An AWS account with permissions to create EC2 instances, VPCs, and IAM roles
Terraform installed locally (>= 1.3)
Helm installed locally (>= 3.0)
kubectl configured to connect to your cluster
The sie-sdk Python package (pip install sie-sdk)

Step 1: Configure the cluster with Terraform

SIE provides an official AWS Terraform module that provisions the GPU infrastructure:

# main.tf
module "sie" {
  source  = "superlinked/sie/aws"
  version = "~> 1.0"

  region = "us-east-1"

  # GPU configuration — mix on-demand and spot for cost efficiency
  gpus = ["a100-40gb", "l4-spot"]

  # VPC configuration (use existing or let SIE create one)
  vpc_id     = "vpc-xxxxxxxx"   # optional — omit to create new
  subnet_ids = ["subnet-xxx"]   # optional

  # Cluster sizing
  min_nodes = 1
  max_nodes = 4
}

output "sie_endpoint" {
  value = module.sie.endpoint
}

terraform init
terraform plan    # review what will be created
terraform apply   # provision the cluster (~10 minutes)

This creates an EKS cluster with GPU node groups, auto-scaling policies, IAM roles, and a load balancer endpoint.

Step 2: Deploy the inference server with Helm

Once the cluster is provisioned, deploy the SIE inference server:

# Add the SIE Helm repository
helm repo add sie oci://ghcr.io/superlinked/charts

# Deploy the cluster chart
helm install sie sie/sie-cluster \
  --namespace sie \
  --create-namespace \
  --set replicaCount=2 \
  --set autoscaling.enabled=true \
  --set autoscaling.maxReplicas=8

The Helm chart deploys the SIE server pods, configures health checks, and sets up horizontal pod autoscaling based on GPU utilisation.

Step 3: Verify the deployment

# Check pods are running
kubectl get pods -n sie

# Check the endpoint is healthy
curl https://<your-sie-endpoint>/health
# → {"status": "ok", "models_loaded": 0}

Step 4: Start encoding

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("https://<your-sie-endpoint>")

# Load a model on first use (cached on GPU after first request)
results = client.encode(
    "BAAI/bge-m3",
    [
        Item(text="self-hosted inference on AWS"),
        Item(text="reduce embedding API costs with your own GPU"),
    ],
)

print(f"Encoded {len(results)} documents, dim={len(results[0]['dense'])}")

Models are downloaded from Hugging Face on first use and cached on the GPU node. Subsequent requests are served from the cache with full GPU throughput.

GPU selection guide for AWS

GPU	Instance type	Best for
A100 40GB	p4d.24xlarge	Largest models, highest throughput
A100 80GB	p4de.24xlarge	Very large models (7B+)
L4	g6.xlarge (spot)	Cost-efficient, medium throughput
T4	g4dn.xlarge	Low-cost dev/staging

For most embedding workloads (BGE-M3, E5-large), L4 spot instances offer the best cost-performance ratio. Use A100s for 7B+ instruction-following models.

Cost optimisation tips

Use spot instances: L4 spot instances reduce GPU costs by 60-70% vs on-demand. SIE handles spot interruption gracefully.

Enable autoscaling: scale down to 1 node during off-peak hours, scale up for batch indexing jobs.

Batch encode: send large batches per request to maximise GPU utilisation:

# Encode 1000 documents in one call — much more efficient than 1000 individual calls
from sie_sdk.types import Item

results = client.encode(
    "BAAI/bge-m3",
    [Item(text=d) for d in document_batch_of_1000],
)
vectors = [r["dense"] for r in results]

Share the cluster across models: SIE’s multi-model support means one cluster serves all your embedding, reranking, and extraction needs.

Frequently asked questions

How long does deployment take? Terraform provisioning takes ~10 minutes. Helm deployment takes ~3 minutes. Total time to first encode request: under 20 minutes.

Is my data sent outside my AWS account? No. SIE runs entirely within your AWS account. Encoding requests go from your application to your SIE cluster, never to Superlinked’s servers.

What AWS regions are supported? SIE supports all AWS regions with GPU instance availability: us-east-1, us-west-2, eu-west-1, ap-southeast-1, and others. Specify region in the Terraform module.

Do I need Kubernetes experience? Basic familiarity with kubectl and Helm is helpful. SIE’s Terraform module abstracts the EKS cluster setup, so you don’t need deep Kubernetes expertise.