Inference

What is GPU Utilisation in Inference?

GPU utilisation in inference refers to the percentage of a GPU’s compute capacity actively used when serving model predictions. High GPU utilisation (70-95%) means you’re getting maximum throughput per dollar. Low utilisation (under 30%) means you’re paying for idle hardware. Efficient batching, model loading strategy, and request scheduling are the primary levers for improving GPU utilisation in embedding and reranking workloads.

Why does GPU utilisation matter?

GPU instances are the primary cost driver in self-hosted inference. An A100 GPU running at 20% utilisation costs the same as one running at 90%, but delivers 4.5× less throughput per dollar spent.

For embedding workloads, the gap between well-optimised and naive deployments is often 10-20× in effective throughput. This directly affects:

Cost: lower utilisation = more GPUs needed for the same throughput
Latency: poorly batched requests waste cycles between jobs
Model ROI: expensive hardware sits idle instead of earning its keep

Why is GPU utilisation low by default?

GPUs are designed for massive parallelism. They shine when processing thousands of operations simultaneously. Naive inference deployments fail to exploit this because:

Single-request processing: handling one encoding request at a time leaves most of the GPU idle between requests. The GPU runs the computation in milliseconds, then waits for the next request.

Small batch sizes: a single sentence encodes in 5ms on an A100. Batching 128 sentences together takes ~15ms, giving 8× better throughput for only 3× the latency.

Multiple idle models: loading one model per GPU instance wastes capacity when multiple small models could share the same GPU memory.

How does batching improve GPU utilisation?

Batching is the most impactful optimisation. Instead of processing one request at a time, batch multiple inputs together:

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Inefficient — 1000 separate requests
for doc in documents:
    vector = client.encode("BAAI/bge-m3", Item(text=doc))["dense"]

# Efficient — one batched request
results = client.encode("BAAI/bge-m3", [Item(text=d) for d in documents])
vectors = [r["dense"] for r in results]  # batch all 1000

SIE implements dynamic batching: it collects incoming requests over a short window (e.g. 10-50ms) and processes them together, even if they come from different API callers. This improves GPU utilisation without requiring the caller to batch requests manually.

Key GPU utilisation metrics

Metric	What it measures	Target
GPU utilisation (%)	Fraction of SM time actively computing	70-95% during load
GPU memory utilisation (%)	Fraction of VRAM in use	60-90%
Throughput (docs/sec)	Documents encoded per second	Model and GPU dependent
Latency P99 (ms)	99th percentile request latency	<100ms for real-time
Batch efficiency	Average batch size processed	Closer to max_batch_size

Monitor these with NVIDIA’s nvidia-smi, Prometheus GPU metrics, or SIE’s built-in monitoring dashboard.

How SIE optimises GPU utilisation

SIE is specifically engineered for high GPU utilisation in embedding and reranking workloads:

Dynamic batching: accumulates concurrent requests and processes them together automatically.

Multi-model GPU sharing: multiple models (e.g. BGE-M3 + a reranker) share the same GPU memory pool, eliminating idle GPUs for secondary models.

Async request queuing: incoming requests queue efficiently so the GPU is never starved between batches.

LoRA adapter hot-loading: swap between domain adapters without reloading the base model, maintaining GPU occupancy.

Quantisation support: INT8 and INT4 quantisation reduces model memory footprint, allowing larger batch sizes on the same GPU.

GPU utilisation vs throughput vs latency trade-offs

There is an inherent tension between these three:

Maximum throughput: large batches, high utilisation, but higher latency per request
Minimum latency: process requests immediately (small or no batching), but lower throughput and utilisation
Production sweet spot: dynamic batching with a latency budget (e.g. max 50ms batch wait) balances all three

For offline indexing jobs (encoding millions of documents), maximise throughput with large batches. For real-time search queries, bound batch wait time to meet latency SLAs.

Practical GPU sizing for common workloads

Workload	Recommended GPU	Expected throughput
BGE-M3, real-time queries	L4	~500-800 docs/sec
BGE-M3, batch indexing	A100 40GB	~2,000-4,000 docs/sec
BGE-M3 + reranker, production	A100 40GB	~800-1,500 combined
E5-mistral-7B (instruct)	A100 80GB	~150-300 docs/sec

Throughput varies significantly with document length and batch size. Run the SIE benchmark example on your specific hardware and document distribution to get accurate numbers.

Frequently asked questions

How do I monitor GPU utilisation in my SIE deployment? SIE exposes Prometheus metrics including GPU utilisation, memory usage, and throughput per model. Pair with Grafana for dashboards. The /metrics endpoint is available on the SIE server.

What is the difference between GPU utilisation and GPU memory utilisation? GPU utilisation measures compute activity (are the CUDA cores busy?). GPU memory utilisation measures VRAM usage (how full is the memory?). Both matter: a model might fit in memory but run at low compute utilisation due to poor batching.

Should I use multiple small GPUs or one large GPU? For a single large model (e.g. E5-mistral-7B), one large GPU (A100 80GB) is often better than multiple small ones. For multiple smaller models, GPU sharing on a single large GPU or multi-GPU clusters both work, and SIE handles either configuration.