google/siglip2-base-patch16-224

Primitive: /encode · Encode · SigLIP

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.

MultimodalDense

View on Hugging Face →

Overview

Hardware: — drives latency, throughput & cost

Size	375M params
Tasks	/encode
License	apache-2.0
Latency	69 ms
Throughput	1.6K tok/s
Cost	$0.140 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Embedding

Output types	Dense
Dimensions	dense: 768
Max sequence length	64
Inputs	text · image

Benchmarks

Flickr30kI2TRetrieval

general retrieval en

Image-to-text retrieval: retrieve captions from images

Corpus: 31,783 Queries: 1,000

Quality

ndcg at 10 0.8157

map at 10 0.7255

mrr at 10 0.9302

Performance L4 b1 c8

Corpus 1.6K tok/s

Corpus p50 68.5ms

Query 13.0 mpix/s

Query p50 99.0ms

Reference →

google/siglip2-base-patch16-224

Overview

Embedding

Benchmarks

Flickr30kI2TRetrieval

Open source inference for agents