naver-clova-ix/donut-base-finetuned-docvqa

Primitive: /extract · Extract · Encoder-Decoder

Donut model fine-tuned on DocVQA. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository.

MultimodalText regions

View on Hugging Face →

Overview

Hardware: — drives latency, throughput & cost

Size	110M params
Tasks	/extract
License	mit
Latency	6.9 s
Throughput	87 tok/s
Cost	$2.56 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Extraction

Output kinds	text_regions
Inputs	image
Max sequence length	—

Benchmarks

DocVQA

general kie en

Visual question answering on document images

Corpus: 5,188 Queries: 5,188

Quality

anls 0.6350

Performance L4-SPOT b1 c4

Performance L4 b1 c16

Reference →

naver-clova-ix/donut-base-finetuned-docvqa

Overview

Extraction

Benchmarks

DocVQA

Open source inference for agents