Vision Tasks
Florence-2, Donut, GroundingDINO, and OWL-v2 models extract structured data from images - captions, OCR text, object detection, and document understanding. Object detection models return DetectedObject results with bounding boxes.
Image Captioning
Section titled “Image Captioning”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": image_bytes, "format": "jpeg"}]), options={"task": "<CAPTION>"})
for entity in result["entities"]: print(entity["text"])# "A golden retriever playing fetch in a park on a sunny day."import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "microsoft/Florence-2-base", { images: [imageBytes] }, // Uint8Array of JPEG/PNG data { options: { task: "<CAPTION>" } });
for (const entity of result.entities) { console.log(entity.text);}
await client.close();OCR (Text from Images)
Section titled “OCR (Text from Images)”result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR>"})
for entity in result["entities"]: print(entity["text"])# Extracted text from the document imageconst result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, // Uint8Array of PNG data { options: { task: "<OCR>" } });
for (const entity of result.entities) { console.log(entity.text);}OCR with Regions
Section titled “OCR with Regions”To get text with bounding box positions (the default task):
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR_WITH_REGION>"})
for entity in result["entities"]: print(f"{entity['text']} at {entity['bbox']}")const result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, { options: { task: "<OCR_WITH_REGION>" } });
for (const entity of result.entities) { console.log(`${entity.text} at ${JSON.stringify(entity.bbox)}`);}Document Understanding
Section titled “Document Understanding”For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):
result = client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", Item(images=[{"data": receipt_image, "format": "jpeg"}]), instruction="What is the total amount?")
for entity in result["entities"]: print(entity["text"])# "$42.50"const result = await client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", { images: [receiptImage] }, { instruction: "What is the total amount?" });
for (const entity of result.entities) { console.log(entity.text);}Object Detection
Section titled “Object Detection”GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as DetectedObject instances with bounding boxes.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract( "IDEA-Research/grounding-dino-base", Item(images=[{"data": image_bytes, "format": "jpeg"}]), labels=["car", "person", "traffic light"])
for obj in result["objects"]: print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")# car: score=0.92, bbox=[120, 200, 450, 380]# person: score=0.88, bbox=[50, 100, 150, 350]import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "IDEA-Research/grounding-dino-base", { images: [imageBytes] }, { labels: ["car", "person", "traffic light"] });
for (const obj of result.objects) { console.log(`${obj.label}: score=${obj.score.toFixed(2)}, bbox=${JSON.stringify(obj.bbox)}`);}// car: score=0.92, bbox=[120,200,450,380]// person: score=0.88, bbox=[50,100,150,350]
await client.close();DetectedObject Fields
Section titled “DetectedObject Fields”| Field | Type | Description |
|---|---|---|
label | str | Object class |
score | float | Confidence score (0-1) |
bbox | list | Bounding box [x1, y1, x2, y2] |
Florence-2 Task Prompts
Section titled “Florence-2 Task Prompts”Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>.
| Task | Task Token | Output |
|---|---|---|
| OCR | <OCR> | Extracted text |
| OCR with regions | <OCR_WITH_REGION> | Text with bounding boxes (default) |
| Caption | <CAPTION> | Image description |
| Detailed caption | <DETAILED_CAPTION> | Extended description |
| Object detection | <OD> | Bounding boxes and labels |
| Dense region caption | <DENSE_REGION_CAPTION> | Region descriptions |
| Phrase grounding | <CAPTION_TO_PHRASE_GROUNDING> | Match labels to regions |
| Document QA | <DocVQA> | Answer to question |
Donut Models
Section titled “Donut Models”Donut models parse structured documents without OCR pre-processing:
naver-clova-ix/donut-base-finetuned-cord-v2- Receipt parsing with key-value extraction (totals, line items, dates)naver-clova-ix/donut-base-finetuned-rvlcdip- Document classification into document types (letter, invoice, memo, etc.)naver-clova-ix/donut-base-finetuned-docvqa- Document question answering (ask natural language questions about a document image)
Vision Models
Section titled “Vision Models”| Model | Tasks |
|---|---|
microsoft/Florence-2-base | Caption, OCR, detection |
microsoft/Florence-2-large | Higher quality Florence-2 |
IDEA-Research/grounding-dino-base | Zero-shot object detection (returns DetectedObject) |
google/owlv2-base-patch16-ensemble | Zero-shot object detection (returns DetectedObject) |
naver-clova-ix/donut-base-finetuned-docvqa | Document question answering |
naver-clova-ix/donut-base-finetuned-cord-v2 | Receipt parsing |
See Full model catalog for the complete list.
What’s Next
Section titled “What’s Next”- NER & Entity Extraction - named entity recognition
- Relations & Classification - relation extraction and text classification
- Full model catalog - all supported models