---
title: What is GPU Utilisation in Inference?
description: "GPU utilisation in inference refers to the percentage of a GPU's compute capacity actively used when serving model predictions. High GPU utilisation (70-95%) means you're getting maximum throughput per dollar. Low utilisation (under 30%) means you're paying for idle hardware. Efficient batching, model loading strate..."
canonical_url: https://superlinked.com/glossary/what-is-gpu-utilisation-in-inference
last_updated: 2026-06-01
---

# What is GPU Utilisation in Inference?

GPU utilisation in inference refers to the percentage of a GPU's compute capacity actively used when serving model predictions. High GPU utilisation (70-95%) means you're getting maximum throughput per dollar. Low utilisation (under 30%) means you're paying for idle hardware. Efficient batching, model loading strategy, and request scheduling are the primary levers for improving GPU utilisation in embedding and reranking workloads.

---

## Why does GPU utilisation matter?

GPU instances are the primary cost driver in self-hosted inference. An A100 GPU running at 20% utilisation costs the same as one running at 90%, but delivers 4.5× less throughput per dollar spent.

For embedding workloads, the gap between well-optimised and naive deployments is often 10-20× in effective throughput. This directly affects:

- **Cost**: lower utilisation = more GPUs needed for the same throughput
- **Latency**: poorly batched requests waste cycles between jobs
- **Model ROI**: expensive hardware sits idle instead of earning its keep

---

## Why is GPU utilisation low by default?

GPUs are designed for massive parallelism. They shine when processing thousands of operations simultaneously. Naive inference deployments fail to exploit this because:

**Single-request processing**: handling one encoding request at a time leaves most of the GPU idle between requests. The GPU runs the computation in milliseconds, then waits for the next request.

**Small batch sizes**: a single sentence encodes in 5ms on an A100. Batching 128 sentences together takes ~15ms, giving 8× better throughput for only 3× the latency.

**Multiple idle models**: loading one model per GPU instance wastes capacity when multiple small models could share the same GPU memory.

---

## How does batching improve GPU utilisation?

Batching is the most impactful optimisation. Instead of processing one request at a time, batch multiple inputs together:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Inefficient — 1000 separate requests
for doc in documents:
    vector = client.encode("BAAI/bge-m3", Item(text=doc))["dense"]

# Efficient — one batched request
results = client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])
vectors = [r["dense"] for r in results]  # batch all 1000
```

SIE implements **dynamic batching**: it collects incoming requests over a short window (e.g. 10-50ms) and processes them together, even if they come from different API callers. This improves GPU utilisation without requiring the caller to batch requests manually.

---

## Key GPU utilisation metrics

| Metric | What it measures | Target |
|---|---|---|
| GPU utilisation (%) | Fraction of SM time actively computing | 70-95% during load |
| GPU memory utilisation (%) | Fraction of VRAM in use | 60-90% |
| Throughput (docs/sec) | Documents encoded per second | Model and GPU dependent |
| Latency P99 (ms) | 99th percentile request latency | <100ms for real-time |
| Batch efficiency | Average batch size processed | Closer to max_batch_size |

Monitor these with NVIDIA's `nvidia-smi`, Prometheus GPU metrics, or SIE's built-in monitoring dashboard.

---

## How SIE optimises GPU utilisation

SIE is specifically engineered for high GPU utilisation in embedding and reranking workloads:

**Dynamic batching**: accumulates concurrent requests and processes them together automatically.

**Multi-model GPU sharing**: multiple models (e.g. BGE-M3 + a reranker) share the same GPU memory pool, eliminating idle GPUs for secondary models.

**Async request queuing**: incoming requests queue efficiently so the GPU is never starved between batches.

**LoRA adapter hot-loading**: swap between domain adapters without reloading the base model, maintaining GPU occupancy.

**Quantisation support**: INT8 and INT4 quantisation reduces model memory footprint, allowing larger batch sizes on the same GPU.

---

## GPU utilisation vs throughput vs latency trade-offs

There is an inherent tension between these three:

- **Maximum throughput**: large batches, high utilisation, but higher latency per request
- **Minimum latency**: process requests immediately (small or no batching), but lower throughput and utilisation
- **Production sweet spot**: dynamic batching with a latency budget (e.g. max 50ms batch wait) balances all three

For offline indexing jobs (encoding millions of documents), maximise throughput with large batches. For real-time search queries, bound batch wait time to meet latency SLAs.

---

## Practical GPU sizing for common workloads

| Workload | Recommended GPU | Expected throughput |
|---|---|---|
| BGE-M3, real-time queries | L4 | ~500-800 docs/sec |
| BGE-M3, batch indexing | A100 40GB | ~2,000-4,000 docs/sec |
| BGE-M3 + reranker, production | A100 40GB | ~800-1,500 combined |
| E5-mistral-7B (instruct) | A100 80GB | ~150-300 docs/sec |

Throughput varies significantly with document length and batch size. Run the [SIE benchmark example](/docs/examples/benchmark) on your specific hardware and document distribution to get accurate numbers.

---

## Frequently asked questions

**How do I monitor GPU utilisation in my SIE deployment?**
SIE exposes Prometheus metrics including GPU utilisation, memory usage, and throughput per model. Pair with Grafana for dashboards. The `/metrics` endpoint is available on the SIE server.

**What is the difference between GPU utilisation and GPU memory utilisation?**
GPU utilisation measures compute activity (are the CUDA cores busy?). GPU memory utilisation measures VRAM usage (how full is the memory?). Both matter: a model might fit in memory but run at low compute utilisation due to poor batching.

**Should I use multiple small GPUs or one large GPU?**
For a single large model (e.g. E5-mistral-7B), one large GPU (A100 80GB) is often better than multiple small ones. For multiple smaller models, GPU sharing on a single large GPU or multi-GPU clusters both work, and SIE handles either configuration.

---

## Related resources

- [SIE vs TEI vs OpenAI benchmark](/docs/examples/benchmark)
- [How to deploy on AWS](/glossary/how-to-deploy-embedding-model-on-aws)
- [How to deploy on GCP](/glossary/how-to-deploy-embedding-model-on-gcp)
- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
- [SIE deployment documentation](/docs/deployment)
