Inference

How Do You Deploy an Embedding Model on GCP?

Deploying an embedding model on GCP with SIE requires configuring a GPU cluster using the SIE Terraform module for Google Cloud, deploying the inference server with Helm, and connecting via the SIE SDK. Your data stays within your GCP project, costs are up to 50× lower than managed API pricing, and the full deployment takes under 30 minutes.

Why deploy on GCP?

Google Cloud is a strong choice for self-hosted embedding inference when:

Your application stack already runs on GCP (GKE, Cloud Run, BigQuery)
You need to keep data in a specific region for compliance (GDPR, data residency)
You want to use Google’s A100 or L4 TPU-adjacent GPU fleet
You’re optimising costs with GCP committed use discounts or spot VMs

SIE’s GCP Terraform module provisions everything needed: a GKE cluster, GPU node pools, autoscaling, and networking, all within your own GCP project.

Prerequisites

A GCP project with billing enabled
GPU quota approved for your target region (request via GCP console if needed)
Terraform installed (>= 1.3)
Helm installed (>= 3.0)
gcloud CLI authenticated (gcloud auth application-default login)
kubectl installed

Step 1: Configure with Terraform

# main.tf
module "sie" {
  source  = "superlinked/sie/google"
  version = "~> 1.0"

  project_id = "your-gcp-project-id"
  region     = "us-central1"

  # GPU configuration
  gpus = ["a100-40gb", "l4-spot"]

  # Optional: use existing network
  network    = "default"
  subnetwork = "default"

  # Autoscaling bounds
  min_nodes = 1
  max_nodes = 6
}

output "sie_endpoint" {
  value = module.sie.endpoint
}

terraform init
terraform plan
terraform apply   # ~12 minutes to provision GKE cluster + GPU nodes

Terraform creates a GKE Autopilot cluster with GPU node pools, a Cloud Load Balancer endpoint, Workload Identity bindings, and autoscaling policies.

Step 2: Deploy the inference server

# Authenticate kubectl with your new GKE cluster
gcloud container clusters get-credentials sie-cluster --region us-central1

# Deploy SIE via Helm
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --namespace sie \
  --create-namespace \
  --set replicaCount=2 \
  --set autoscaling.enabled=true

Step 3: Verify

kubectl get pods -n sie
# NAME                    READY   STATUS    RESTARTS   AGE
# sie-7d9f4b8c6-xk2p9    1/1     Running   0          2m
# sie-7d9f4b8c6-w8lmq    1/1     Running   0          2m

curl https://<your-sie-endpoint>/health
# {"status": "ok"}

Step 4: Start encoding

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("https://<your-sie-endpoint>")

results = client.encode(
    "BAAI/bge-m3",
    [
        Item(text="self-hosted inference on GCP"),
        Item(text="keep embeddings within your Google Cloud project"),
    ],
)

print(f"Encoded {len(results)} docs, dim={len(results[0]['dense'])}")

GCP GPU selection guide

GPU	Machine type	Best for
A100 40GB	a2-highgpu-1g	High-throughput, large models
A100 80GB	a2-megagpu-16g	Very large models (7B+)
L4	g2-standard-4 (spot)	Cost-efficient standard workloads
T4	n1-standard-4 + T4	Dev/staging, cost-sensitive

GCP L4 spot VMs offer excellent price-performance for BGE-M3 and similar models. Spot preemption is handled by SIE’s cluster gracefully, with requests rerouted to healthy nodes.

Enabling GPU quota on GCP

GCP requires explicit quota approval for GPU instances. If you see quota errors during terraform apply:

Go to IAM & Admin → Quotas in the GCP Console
Search for NVIDIA_A100_GPUS or NVIDIA_L4_GPUS in your target region
Click Edit Quotas, request the number you need (start with 4-8)
Approval typically takes a few hours for small quotas

Cost comparison: GCP self-hosted vs managed APIs

Encoding 100M documents with BGE-M3 (avg 256 tokens each):

Option	Cost estimate
OpenAI text-embedding-3-small	~$1,300
Voyage AI	~$800
SIE on GCP L4 spot (on-demand)	~$25-40

The self-hosted cost varies by GPU utilisation and run time, but the savings are substantial at scale.

Frequently asked questions

Does SIE support GKE Autopilot? SIE works with both GKE Standard and GKE Autopilot. Autopilot simplifies node management but has more constraints on GPU configuration. Standard mode gives more control for GPU workloads.

Can I use Workload Identity with SIE? Yes. The Terraform module configures Workload Identity by default, so SIE pods can access GCS buckets or Secret Manager without service account key files.

What regions have good GPU availability on GCP? us-central1 and us-east1 typically have the best A100 and L4 availability. europe-west4 is the best European option. Check GCP’s GPU availability dashboard before choosing a region.