How Do You Deploy an Embedding Model on GCP?
Deploying an embedding model on GCP with SIE requires configuring a GPU cluster using the SIE Terraform module for Google Cloud, deploying the inference server with Helm, and connecting via the SIE SDK. Your data stays within your GCP project, costs are up to 50× lower than managed API pricing, and the full deployment takes under 30 minutes.
Why deploy on GCP?
Google Cloud is a strong choice for self-hosted embedding inference when:
- Your application stack already runs on GCP (GKE, Cloud Run, BigQuery)
- You need to keep data in a specific region for compliance (GDPR, data residency)
- You want to use Google’s A100 or L4 TPU-adjacent GPU fleet
- You’re optimising costs with GCP committed use discounts or spot VMs
SIE’s GCP Terraform module provisions everything needed: a GKE cluster, GPU node pools, autoscaling, and networking, all within your own GCP project.
Prerequisites
- A GCP project with billing enabled
- GPU quota approved for your target region (request via GCP console if needed)
- Terraform installed (
>= 1.3) - Helm installed (
>= 3.0) gcloudCLI authenticated (gcloud auth application-default login)kubectlinstalled
Step 1: Configure with Terraform
# main.tfmodule "sie" { source = "superlinked/sie/google" version = "~> 1.0"
project_id = "your-gcp-project-id" region = "us-central1"
# GPU configuration gpus = ["a100-40gb", "l4-spot"]
# Optional: use existing network network = "default" subnetwork = "default"
# Autoscaling bounds min_nodes = 1 max_nodes = 6}
output "sie_endpoint" { value = module.sie.endpoint}terraform initterraform planterraform apply # ~12 minutes to provision GKE cluster + GPU nodesTerraform creates a GKE Autopilot cluster with GPU node pools, a Cloud Load Balancer endpoint, Workload Identity bindings, and autoscaling policies.
Step 2: Deploy the inference server
# Authenticate kubectl with your new GKE clustergcloud container clusters get-credentials sie-cluster --region us-central1
# Deploy SIE via Helmhelm install sie oci://ghcr.io/superlinked/charts/sie-cluster \ --namespace sie \ --create-namespace \ --set replicaCount=2 \ --set autoscaling.enabled=trueStep 3: Verify
kubectl get pods -n sie# NAME READY STATUS RESTARTS AGE# sie-7d9f4b8c6-xk2p9 1/1 Running 0 2m# sie-7d9f4b8c6-w8lmq 1/1 Running 0 2m
curl https://<your-sie-endpoint>/health# {"status": "ok"}Step 4: Start encoding
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("https://<your-sie-endpoint>")
results = client.encode( "BAAI/bge-m3", [ Item(text="self-hosted inference on GCP"), Item(text="keep embeddings within your Google Cloud project"), ],)
print(f"Encoded {len(results)} docs, dim={len(results[0]['dense'])}")GCP GPU selection guide
| GPU | Machine type | Best for |
|---|---|---|
| A100 40GB | a2-highgpu-1g | High-throughput, large models |
| A100 80GB | a2-megagpu-16g | Very large models (7B+) |
| L4 | g2-standard-4 (spot) | Cost-efficient standard workloads |
| T4 | n1-standard-4 + T4 | Dev/staging, cost-sensitive |
GCP L4 spot VMs offer excellent price-performance for BGE-M3 and similar models. Spot preemption is handled by SIE’s cluster gracefully, with requests rerouted to healthy nodes.
Enabling GPU quota on GCP
GCP requires explicit quota approval for GPU instances. If you see quota errors during terraform apply:
- Go to IAM & Admin → Quotas in the GCP Console
- Search for
NVIDIA_A100_GPUSorNVIDIA_L4_GPUSin your target region - Click Edit Quotas, request the number you need (start with 4-8)
- Approval typically takes a few hours for small quotas
Cost comparison: GCP self-hosted vs managed APIs
Encoding 100M documents with BGE-M3 (avg 256 tokens each):
| Option | Cost estimate |
|---|---|
| OpenAI text-embedding-3-small | ~$1,300 |
| Voyage AI | ~$800 |
| SIE on GCP L4 spot (on-demand) | ~$25-40 |
The self-hosted cost varies by GPU utilisation and run time, but the savings are substantial at scale.
Frequently asked questions
Does SIE support GKE Autopilot? SIE works with both GKE Standard and GKE Autopilot. Autopilot simplifies node management but has more constraints on GPU configuration. Standard mode gives more control for GPU workloads.
Can I use Workload Identity with SIE? Yes. The Terraform module configures Workload Identity by default, so SIE pods can access GCS buckets or Secret Manager without service account key files.
What regions have good GPU availability on GCP?
us-central1 and us-east1 typically have the best A100 and L4 availability. europe-west4 is the best European option. Check GCP’s GPU availability dashboard before choosing a region.