Why did we open-source our inference engine? Read the post

nvidia/llama-nemoretriever-colembed-3b-v1

The nvidia/llama-nemoretriever-colembed-3b-v1 is a late interaction embedding model fine-tuned for query-document retrieval. Users can input `queries`, which are text, or `documents` which are page images, to the model.

Overview

Architecture
llama_nemoretrievercolembed
Parameters
4.4B
Tasks
Encode
Outputs
Multi-Vec
Dimensions
Multi-Vec: 128
Max Sequence Length
8,192 tokens
License
other
Languages
multilingual

Benchmarks

Vidore3ComputerScienceRetrieval

technology retrieval en

Visual document retrieval on computer science papers and slides

Performance L4 b1 c4
Corpus 0.6 img/s
Corpus p50 6.2s
Query 400 tok/s
Query p50 184.7ms
Reference →

Vidore3FinanceEnRetrieval

finance retrieval en

Visual document retrieval on financial reports

Performance L4 b1 c4
Corpus 0.6 img/s
Corpus p50 6.1s
Query 502 tok/s
Query p50 152.7ms
Reference →

Vidore3HrRetrieval

general retrieval en

Visual document retrieval on HR-related documents

Quality
ndcg at 10 0.6513
map at 10 0.5053
mrr at 10 0.7844
Performance L4 b1 c16
Corpus 0.9 img/s
Corpus p50 17.9s
Query 689 tok/s
Query p50 740.7ms
Reference →

Vidore3PharmaceuticalsRetrieval

medical retrieval en

Visual document retrieval on pharmaceutical documents

Performance L4 b1 c4
Corpus 0.7 img/s
Corpus p50 6.0s
Query 420 tok/s
Query p50 185.5ms
Reference →

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.