---
title: Hardware & Capacity
description: GPU selection, memory planning, and capacity estimation for SIE deployments.
canonical_url: https://superlinked.com/docs/deployment/resources
last_updated: 2026-05-20
---

Choosing the right hardware impacts cost, latency, and throughput. This guide covers GPU selection, memory planning, and capacity estimation.

## GPU Selection Guide

Source: [deploy/terraform/gcp/infra/node_pools.tf](https://github.com/superlinked/sie/blob/main/deploy/terraform/gcp/infra/node_pools.tf)

SIE supports NVIDIA GPUs via CUDA. Choose based on your model size and throughput requirements.

### Recommended GPUs

| GPU | VRAM | Best For | GCP Machine Type |
|-----|------|----------|-----------------|
| NVIDIA L4 | 24 GB | Most embedding models, cost-effective inference | `g2-standard-8` (1x), `g2-standard-24` (2x) |
| NVIDIA A100 40GB | 40 GB | Large models, high throughput | `a2-highgpu-1g` |
| NVIDIA A100 80GB | 80 GB | Very large models (7B+), multi-model serving | `a2-ultragpu-1g` |
| NVIDIA H100 | 80 GB | Highest throughput, latest generation | `a3-highgpu-1g` |

### Budget Options

| GPU | VRAM | Best For | GCP Machine Type |
|-----|------|----------|-----------------|
| NVIDIA T4 | 16 GB | Small models, development, testing | `n1-standard-8` + T4 |

**L4 is recommended for most production workloads.** It offers the best price-performance ratio for embedding models under 4B parameters.

## Memory Planning

Source: [packages/sie_server/src/sie_server/core/memory.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/memory.py)

GPU memory usage depends on model size, batch size, and sequence length.

### Model Size Categories

| Category | Parameters | Approximate VRAM | Example Models |
|----------|------------|------------------|----------------|
| Small | < 100M | 0.5-1 GB | all-MiniLM-L6-v2 |
| Medium | 100M-500M | 1-3 GB | bge-m3, e5-large-v2, multilingual-e5-large |
| Large | 500M-2B | 3-8 GB | gte-Qwen2-1.5B-instruct, stella_en_1.5B_v5 |
| XLarge | 2B-8B | 8-20 GB | Qwen3-Embedding-4B, e5-mistral-7b-instruct, NV-Embed-v2 |

### Batch Memory Overhead

Beyond model weights, inference requires memory for:

- **Activations**: Proportional to batch size and sequence length
- **KV cache**: For transformer attention (significant for long sequences)
- **CUDA context**: ~500MB-1GB fixed overhead per GPU

**Rule of thumb**: Reserve 2-3x the model weight size for safe operation with batching.

### Multi-Model Serving

SIE loads models on-demand and uses LRU eviction when memory pressure exceeds 85%:

```python
# From memory.py - default eviction threshold
pressure_threshold: float = 0.85  # Evict LRU model above 85%
```

For multi-model deployments, provision VRAM for:
- Your largest model (always loaded)
- 1-2 additional frequently-used models
- Headroom for batch processing

**Example**: Serving bge-m3 (~2GB) and e5-mistral-7b (~15GB) together requires at least 24GB VRAM.

## Capacity Planning

Throughput varies by model architecture, sequence length, and hardware. Use these estimates as starting points.

### Throughput by Model Type (L4 GPU)

Based on actual measurements with 16 concurrent requests:

| Model Type | Example | Corpus Throughput | Query Throughput |
|------------|---------|-------------------|------------------|
| Small encoder | all-MiniLM-L6-v2 | ~50,000 tokens/sec | ~5,000 tokens/sec |
| Medium encoder | bge-m3 | ~30,000 tokens/sec | ~3,000 tokens/sec |
| Large LLM-based | Qwen3-Embedding-4B | ~5,000 tokens/sec | ~700 tokens/sec |
| XLarge LLM-based | e5-mistral-7b | ~3,000 tokens/sec | ~400 tokens/sec |

**Corpus vs Query**: Corpus encoding uses longer sequences (documents). Query encoding uses shorter sequences (search queries).

### Scaling Estimates

For horizontal scaling, estimate required replicas:

```
replicas = (target_throughput / single_gpu_throughput) * safety_factor
```

Use a safety factor of 1.3-1.5 to account for traffic spikes and variance.

**Example**: To achieve 100,000 tokens/sec with bge-m3:
- Single L4 throughput: ~30,000 tokens/sec
- Replicas needed: (100,000 / 30,000) * 1.4 = 4-5 replicas

## Cost Optimization

### Spot/Preemptible Instances

Source: [deploy/terraform/gcp/infra/node_pools.tf](https://github.com/superlinked/sie/blob/main/deploy/terraform/gcp/infra/node_pools.tf)

The Terraform configuration supports spot instances for GPU node pools:

```hcl
# From infra/node_pools.tf
spot = each.value.spot  # Enable for 60-90% cost savings
```

**Recommended for**:
- Batch processing workloads
- Non-latency-critical embedding jobs
- Development and testing

**Not recommended for**:
- Low-latency serving with strict SLAs
- Single-replica deployments

### Scale-to-Zero

For variable traffic, configure Kubernetes HPA with minimum replicas of 0. Combine with:

- Keda for event-driven scaling
- GKE Autopilot for automatic node provisioning
- Preemptible node pools for cost savings during scale-up

**Cold start latency**: Model loading adds 10-60 seconds depending on model size. Consider keeping at least one warm replica for latency-sensitive workloads.

### Right-Sizing Checklist

1. **Start with L4** - Upgrade to A100 only if models exceed 24GB VRAM
2. **Use spot instances** - Enable for batch workloads and non-critical paths
3. **Measure actual throughput** - Run performance evals before capacity planning
4. **Monitor memory pressure** - High eviction rates indicate undersized VRAM

## GCP GPU Quotas

Before deploying, request sufficient GPU quota in your target region.

### Checking Quotas

```bash
# List GPU quotas in a region
gcloud compute regions describe us-central1 \
  --format="table(quotas.filter(metric ~ GPU))"
```

### Common Quota Types

| Quota Name | GPU Type | Notes |
|------------|----------|-------|
| `NVIDIA_L4_GPUS` | L4 | Most available, recommended |
| `NVIDIA_A100_GPUS` | A100 40GB | Limited availability |
| `NVIDIA_A100_80GB_GPUS` | A100 80GB | Very limited |
| `NVIDIA_H100_GPUS` | H100 | Newest, limited availability |
| `NVIDIA_T4_GPUS` | T4 | Widely available |

### Requesting Quota Increases

1. Go to [IAM & Admin > Quotas](https://console.cloud.google.com/iam-admin/quotas)
2. Filter by service "Compute Engine API" and metric containing "GPU"
3. Select the quota and click "Edit Quotas"
4. Provide justification and submit

**Tip**: Request quota in multiple regions. GPU availability varies significantly by zone.

### Zone Availability

GPU availability varies by zone. Check before provisioning:

```bash
# List zones with L4 GPUs
gcloud compute accelerator-types list --filter="name=nvidia-l4"
```

## What's Next

- [Request Lifecycle](/docs/engine/) - How SIE processes requests through batching and inference
- [CLI Reference](/docs/reference/cli/) - Server configuration options including device selection
