Why did we open-source our inference engine? Read the post
← All Glossary Articles

How Do You Deploy an Embedding Model on GCP?

Deploying an embedding model on GCP with SIE requires configuring a GPU cluster using the SIE Terraform module for Google Cloud, deploying the inference server with Helm, and connecting via the SIE SDK. Your data stays within your GCP project, costs are up to 50× lower than managed API pricing, and the full deployment takes under 30 minutes.


Why deploy on GCP?

Google Cloud is a strong choice for self-hosted embedding inference when:

  • Your application stack already runs on GCP (GKE, Cloud Run, BigQuery)
  • You need to keep data in a specific region for compliance (GDPR, data residency)
  • You want to use Google’s A100 or L4 TPU-adjacent GPU fleet
  • You’re optimising costs with GCP committed use discounts or spot VMs

SIE’s GCP Terraform module provisions everything needed: a GKE cluster, GPU node pools, autoscaling, and networking, all within your own GCP project.


Prerequisites

  • A GCP project with billing enabled
  • GPU quota approved for your target region (request via GCP console if needed)
  • Terraform installed (>= 1.3)
  • Helm installed (>= 3.0)
  • gcloud CLI authenticated (gcloud auth application-default login)
  • kubectl installed

Step 1: Configure with Terraform

# main.tf
module "sie" {
source = "superlinked/sie/google"
version = "~> 1.0"
project_id = "your-gcp-project-id"
region = "us-central1"
# GPU configuration
gpus = ["a100-40gb", "l4-spot"]
# Optional: use existing network
network = "default"
subnetwork = "default"
# Autoscaling bounds
min_nodes = 1
max_nodes = 6
}
output "sie_endpoint" {
value = module.sie.endpoint
}
terraform init
terraform plan
terraform apply # ~12 minutes to provision GKE cluster + GPU nodes

Terraform creates a GKE Autopilot cluster with GPU node pools, a Cloud Load Balancer endpoint, Workload Identity bindings, and autoscaling policies.


Step 2: Deploy the inference server

# Authenticate kubectl with your new GKE cluster
gcloud container clusters get-credentials sie-cluster --region us-central1
# Deploy SIE via Helm
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster \
--namespace sie \
--create-namespace \
--set replicaCount=2 \
--set autoscaling.enabled=true

Step 3: Verify

kubectl get pods -n sie
# NAME READY STATUS RESTARTS AGE
# sie-7d9f4b8c6-xk2p9 1/1 Running 0 2m
# sie-7d9f4b8c6-w8lmq 1/1 Running 0 2m
curl https://<your-sie-endpoint>/health
# {"status": "ok"}

Step 4: Start encoding

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("https://<your-sie-endpoint>")
results = client.encode(
"BAAI/bge-m3",
[
Item(text="self-hosted inference on GCP"),
Item(text="keep embeddings within your Google Cloud project"),
],
)
print(f"Encoded {len(results)} docs, dim={len(results[0]['dense'])}")

GCP GPU selection guide

GPUMachine typeBest for
A100 40GBa2-highgpu-1gHigh-throughput, large models
A100 80GBa2-megagpu-16gVery large models (7B+)
L4g2-standard-4 (spot)Cost-efficient standard workloads
T4n1-standard-4 + T4Dev/staging, cost-sensitive

GCP L4 spot VMs offer excellent price-performance for BGE-M3 and similar models. Spot preemption is handled by SIE’s cluster gracefully, with requests rerouted to healthy nodes.


Enabling GPU quota on GCP

GCP requires explicit quota approval for GPU instances. If you see quota errors during terraform apply:

  1. Go to IAM & Admin → Quotas in the GCP Console
  2. Search for NVIDIA_A100_GPUS or NVIDIA_L4_GPUS in your target region
  3. Click Edit Quotas, request the number you need (start with 4-8)
  4. Approval typically takes a few hours for small quotas

Cost comparison: GCP self-hosted vs managed APIs

Encoding 100M documents with BGE-M3 (avg 256 tokens each):

OptionCost estimate
OpenAI text-embedding-3-small~$1,300
Voyage AI~$800
SIE on GCP L4 spot (on-demand)~$25-40

The self-hosted cost varies by GPU utilisation and run time, but the savings are substantial at scale.


Frequently asked questions

Does SIE support GKE Autopilot? SIE works with both GKE Standard and GKE Autopilot. Autopilot simplifies node management but has more constraints on GPU configuration. Standard mode gives more control for GPU workloads.

Can I use Workload Identity with SIE? Yes. The Terraform module configures Workload Identity by default, so SIE pods can access GCS buckets or Secret Manager without service account key files.

What regions have good GPU availability on GCP? us-central1 and us-east1 typically have the best A100 and L4 availability. europe-west4 is the best European option. Check GCP’s GPU availability dashboard before choosing a region.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.