Skip to content
Why did we open-source our inference engine? Read the post

Vision Tasks

SIE supports three families of vision extraction beyond OCR:

  • Image Captioning (microsoft/Florence-2-base). Describe images with <CAPTION> and <DETAILED_CAPTION> task tokens.
  • Object detection (IDEA-Research/grounding-dino-base, google/owlv2-base-patch16-ensemble). Return DetectedObject results with zero-shot labels and bounding boxes.
  • Visual QA (naver-clova-ix/donut-base-finetuned-docvqa, Florence-2 <DocVQA>). Answer natural-language questions about an image or document.

For converting document images or PDFs to Markdown (including the four dedicated OCR adapters and Florence-2’s <OCR> task), see OCR.

Pick by what you need to extract: a description of an image → image captioning; specific objects → detection; an answer to a question about an image → visual QA.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": image_bytes, "format": "jpeg"}]),
options={"task": "<CAPTION>"}
)
for entity in result["entities"]:
print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."

Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>. For Florence-2 OCR usage and code samples, see OCR.

TaskTask TokenOutput
OCR<OCR>Extracted text
OCR with regions<OCR_WITH_REGION>Text with bounding boxes (default)
Caption<CAPTION>Image description
Detailed caption<DETAILED_CAPTION>Extended description
Object detection<OD>Bounding boxes and labels
Dense region caption<DENSE_REGION_CAPTION>Region descriptions
Phrase grounding<CAPTION_TO_PHRASE_GROUNDING>Match labels to regions
Document QA<DocVQA>Answer to question

GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as DetectedObject instances with bounding boxes.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract(
"IDEA-Research/grounding-dino-base",
Item(images=[{"data": image_bytes, "format": "jpeg"}]),
labels=["car", "person", "traffic light"]
)
for obj in result["objects"]:
print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")
# car: score=0.92, bbox=[120, 200, 450, 380]
# person: score=0.88, bbox=[50, 100, 150, 350]
FieldTypeDescription
labelstrObject class
scorefloatConfidence score (0-1)
bboxlistBounding box [x1, y1, x2, y2]

For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):

result = client.extract(
"naver-clova-ix/donut-base-finetuned-docvqa",
Item(images=[{"data": receipt_image, "format": "jpeg"}]),
instruction="What is the total amount?"
)
for entity in result["entities"]:
print(entity["text"])
# "$42.50"

Donut models parse structured documents without OCR pre-processing:

  • naver-clova-ix/donut-base-finetuned-cord-v2 - Receipt parsing with key-value extraction (totals, line items, dates)
  • naver-clova-ix/donut-base-finetuned-rvlcdip - Document classification into document types (letter, invoice, memo, etc.)
  • naver-clova-ix/donut-base-finetuned-docvqa - Document question answering (ask natural language questions about a document image)

See the model catalog (Extract) for the complete list of vision and extraction models.

Contact us

Tell us about your use case and we'll get back to you shortly.