Skip to content
Why did we open-source our inference engine? Read the post

Vision Tasks

Florence-2, Donut, GroundingDINO, and OWL-v2 models extract structured data from images - captions, OCR text, object detection, and document understanding. Object detection models return DetectedObject results with bounding boxes.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": image_bytes, "format": "jpeg"}]),
options={"task": "<CAPTION>"}
)
for entity in result["entities"]:
print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."
result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": document_image, "format": "png"}]),
options={"task": "<OCR>"}
)
for entity in result["entities"]:
print(entity["text"])
# Extracted text from the document image

To get text with bounding box positions (the default task):

result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": document_image, "format": "png"}]),
options={"task": "<OCR_WITH_REGION>"}
)
for entity in result["entities"]:
print(f"{entity['text']} at {entity['bbox']}")

For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):

result = client.extract(
"naver-clova-ix/donut-base-finetuned-docvqa",
Item(images=[{"data": receipt_image, "format": "jpeg"}]),
instruction="What is the total amount?"
)
for entity in result["entities"]:
print(entity["text"])
# "$42.50"

GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as DetectedObject instances with bounding boxes.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract(
"IDEA-Research/grounding-dino-base",
Item(images=[{"data": image_bytes, "format": "jpeg"}]),
labels=["car", "person", "traffic light"]
)
for obj in result["objects"]:
print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")
# car: score=0.92, bbox=[120, 200, 450, 380]
# person: score=0.88, bbox=[50, 100, 150, 350]
FieldTypeDescription
labelstrObject class
scorefloatConfidence score (0-1)
bboxlistBounding box [x1, y1, x2, y2]

Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>.

TaskTask TokenOutput
OCR<OCR>Extracted text
OCR with regions<OCR_WITH_REGION>Text with bounding boxes (default)
Caption<CAPTION>Image description
Detailed caption<DETAILED_CAPTION>Extended description
Object detection<OD>Bounding boxes and labels
Dense region caption<DENSE_REGION_CAPTION>Region descriptions
Phrase grounding<CAPTION_TO_PHRASE_GROUNDING>Match labels to regions
Document QA<DocVQA>Answer to question

Donut models parse structured documents without OCR pre-processing:

  • naver-clova-ix/donut-base-finetuned-cord-v2 - Receipt parsing with key-value extraction (totals, line items, dates)
  • naver-clova-ix/donut-base-finetuned-rvlcdip - Document classification into document types (letter, invoice, memo, etc.)
  • naver-clova-ix/donut-base-finetuned-docvqa - Document question answering (ask natural language questions about a document image)
ModelTasks
microsoft/Florence-2-baseCaption, OCR, detection
microsoft/Florence-2-largeHigher quality Florence-2
IDEA-Research/grounding-dino-baseZero-shot object detection (returns DetectedObject)
google/owlv2-base-patch16-ensembleZero-shot object detection (returns DetectedObject)
naver-clova-ix/donut-base-finetuned-docvqaDocument question answering
naver-clova-ix/donut-base-finetuned-cord-v2Receipt parsing

See Full model catalog for the complete list.

Contact us

Tell us about your use case and we'll get back to you shortly.