---
title: How Do You Deploy an Embedding Model on GCP?
description: Deploying an embedding model on GCP with SIE requires configuring a GPU cluster using the SIE Terraform module for Google Cloud, deploying the inference server with Helm, and connecting via the SIE SDK. Your data stays within your GCP project, costs are up to 50× lower than managed API pricing, and the full deployme...
canonical_url: https://superlinked.com/glossary/how-to-deploy-embedding-model-on-gcp
last_updated: 2026-06-01
---

# How Do You Deploy an Embedding Model on GCP?

Deploying an embedding model on GCP with SIE requires configuring a GPU cluster using the SIE Terraform module for Google Cloud, deploying the inference server with Helm, and connecting via the SIE SDK. Your data stays within your GCP project, costs are up to 50× lower than managed API pricing, and the full deployment takes under 30 minutes.

---

## Why deploy on GCP?

Google Cloud is a strong choice for self-hosted embedding inference when:

- Your application stack already runs on GCP (GKE, Cloud Run, BigQuery)
- You need to keep data in a specific region for compliance (GDPR, data residency)
- You want to use Google's A100 or L4 TPU-adjacent GPU fleet
- You're optimising costs with GCP committed use discounts or spot VMs

SIE's GCP Terraform module provisions everything needed: a GKE cluster, GPU node pools, autoscaling, and networking, all within your own GCP project.

---

## Prerequisites

- A GCP project with billing enabled
- GPU quota approved for your target region (request via GCP console if needed)
- Terraform installed (`>= 1.3`)
- Helm installed (`>= 3.0`)
- `gcloud` CLI authenticated (`gcloud auth application-default login`)
- `kubectl` installed

---

## Step 1: Configure with Terraform

```hcl
# main.tf
module "sie" {
  source  = "superlinked/sie/google"
  version = "~> 1.0"

  project_id = "your-gcp-project-id"
  region     = "us-central1"

  # GPU configuration
  gpus = ["a100-40gb", "l4-spot"]

  # Optional: use existing network
  network    = "default"
  subnetwork = "default"

  # Autoscaling bounds
  min_nodes = 1
  max_nodes = 6
}

output "sie_endpoint" {
  value = module.sie.endpoint
}
```

```bash
terraform init
terraform plan
terraform apply   # ~12 minutes to provision GKE cluster + GPU nodes
```

Terraform creates a GKE Autopilot cluster with GPU node pools, a Cloud Load Balancer endpoint, Workload Identity bindings, and autoscaling policies.

---

## Step 2: Deploy the inference server

```bash
# Authenticate kubectl with your new GKE cluster
gcloud container clusters get-credentials sie-cluster --region us-central1

# Deploy SIE via Helm
helm install sie oci://ghcr.io/superlinked/charts/sie-cluster \
  --namespace sie \
  --create-namespace \
  --set replicaCount=2 \
  --set autoscaling.enabled=true
```

---

## Step 3: Verify

```bash
kubectl get pods -n sie
# NAME                    READY   STATUS    RESTARTS   AGE
# sie-7d9f4b8c6-xk2p9    1/1     Running   0          2m
# sie-7d9f4b8c6-w8lmq    1/1     Running   0          2m

curl https://<your-sie-endpoint>/health
# {"status": "ok"}
```

---

## Step 4: Start encoding

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("https://<your-sie-endpoint>")

results = client.encode(
    "BAAI/bge-m3",
    [
        Item(text="self-hosted inference on GCP"),
        Item(text="keep embeddings within your Google Cloud project"),
    ],
)

print(f"Encoded {len(results)} docs, dim={len(results[0]['dense'])}")
```

---

## GCP GPU selection guide

| GPU | Machine type | Best for |
|---|---|---|
| A100 40GB | a2-highgpu-1g | High-throughput, large models |
| A100 80GB | a2-megagpu-16g | Very large models (7B+) |
| L4 | g2-standard-4 (spot) | Cost-efficient standard workloads |
| T4 | n1-standard-4 + T4 | Dev/staging, cost-sensitive |

GCP L4 spot VMs offer excellent price-performance for BGE-M3 and similar models. Spot preemption is handled by SIE's cluster gracefully, with requests rerouted to healthy nodes.

---

## Enabling GPU quota on GCP

GCP requires explicit quota approval for GPU instances. If you see quota errors during `terraform apply`:

1. Go to **IAM & Admin → Quotas** in the GCP Console
2. Search for `NVIDIA_A100_GPUS` or `NVIDIA_L4_GPUS` in your target region
3. Click **Edit Quotas**, request the number you need (start with 4-8)
4. Approval typically takes a few hours for small quotas

---

## Cost comparison: GCP self-hosted vs managed APIs

Encoding 100M documents with BGE-M3 (avg 256 tokens each):

| Option | Cost estimate |
|---|---|
| OpenAI text-embedding-3-small | ~$1,300 |
| Voyage AI | ~$800 |
| SIE on GCP L4 spot (on-demand) | ~$25-40 |

The self-hosted cost varies by GPU utilisation and run time, but the savings are substantial at scale.

---

## Frequently asked questions

**Does SIE support GKE Autopilot?**
SIE works with both GKE Standard and GKE Autopilot. Autopilot simplifies node management but has more constraints on GPU configuration. Standard mode gives more control for GPU workloads.

**Can I use Workload Identity with SIE?**
Yes. The Terraform module configures Workload Identity by default, so SIE pods can access GCS buckets or Secret Manager without service account key files.

**What regions have good GPU availability on GCP?**
`us-central1` and `us-east1` typically have the best A100 and L4 availability. `europe-west4` is the best European option. Check GCP's GPU availability dashboard before choosing a region.

---

## Related resources

- [SIE deployment documentation](/docs/deployment)
- [How do you deploy on AWS?](/glossary/how-to-deploy-embedding-model-on-aws)
- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
- [SIE vs TEI vs OpenAI benchmark](/docs/examples/benchmark)
- [What is GPU utilisation in inference?](/glossary/what-is-gpu-utilisation-in-inference)