Vision Tasks
SIE supports three families of vision extraction beyond OCR:
- Image Captioning (
microsoft/Florence-2-base). Describe images with<CAPTION>and<DETAILED_CAPTION>task tokens. - Object detection (
IDEA-Research/grounding-dino-base,google/owlv2-base-patch16-ensemble). ReturnDetectedObjectresults with zero-shot labels and bounding boxes. - Visual QA (
naver-clova-ix/donut-base-finetuned-docvqa, Florence-2<DocVQA>). Answer natural-language questions about an image or document.
For converting document images or PDFs to Markdown (including the four dedicated OCR adapters and Florence-2’s <OCR> task), see OCR.
Pick by what you need to extract: a description of an image → image captioning; specific objects → detection; an answer to a question about an image → visual QA.
Image Captioning
Section titled “Image Captioning”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": image_bytes, "format": "jpeg"}]), options={"task": "<CAPTION>"})
for entity in result["entities"]: print(entity["text"])# "A golden retriever playing fetch in a park on a sunny day."import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "microsoft/Florence-2-base", { images: [imageBytes] }, // Uint8Array of JPEG/PNG data { options: { task: "<CAPTION>" } });
for (const entity of result.entities) { console.log(entity.text);}
await client.close();Florence-2 Task Prompts
Section titled “Florence-2 Task Prompts”Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>. For Florence-2 OCR usage and code samples, see OCR.
| Task | Task Token | Output |
|---|---|---|
| OCR | <OCR> | Extracted text |
| OCR with regions | <OCR_WITH_REGION> | Text with bounding boxes (default) |
| Caption | <CAPTION> | Image description |
| Detailed caption | <DETAILED_CAPTION> | Extended description |
| Object detection | <OD> | Bounding boxes and labels |
| Dense region caption | <DENSE_REGION_CAPTION> | Region descriptions |
| Phrase grounding | <CAPTION_TO_PHRASE_GROUNDING> | Match labels to regions |
| Document QA | <DocVQA> | Answer to question |
Object Detection
Section titled “Object Detection”GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as DetectedObject instances with bounding boxes.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract( "IDEA-Research/grounding-dino-base", Item(images=[{"data": image_bytes, "format": "jpeg"}]), labels=["car", "person", "traffic light"])
for obj in result["objects"]: print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")# car: score=0.92, bbox=[120, 200, 450, 380]# person: score=0.88, bbox=[50, 100, 150, 350]import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "IDEA-Research/grounding-dino-base", { images: [imageBytes] }, { labels: ["car", "person", "traffic light"] });
for (const obj of result.objects) { console.log(`${obj.label}: score=${obj.score.toFixed(2)}, bbox=${JSON.stringify(obj.bbox)}`);}// car: score=0.92, bbox=[120,200,450,380]// person: score=0.88, bbox=[50,100,150,350]
await client.close();DetectedObject Fields
Section titled “DetectedObject Fields”| Field | Type | Description |
|---|---|---|
label | str | Object class |
score | float | Confidence score (0-1) |
bbox | list | Bounding box [x1, y1, x2, y2] |
Visual QA
Section titled “Visual QA”For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):
result = client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", Item(images=[{"data": receipt_image, "format": "jpeg"}]), instruction="What is the total amount?")
for entity in result["entities"]: print(entity["text"])# "$42.50"const result = await client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", { images: [receiptImage] }, { instruction: "What is the total amount?" });
for (const entity of result.entities) { console.log(entity.text);}Donut Models
Section titled “Donut Models”Donut models parse structured documents without OCR pre-processing:
naver-clova-ix/donut-base-finetuned-cord-v2- Receipt parsing with key-value extraction (totals, line items, dates)naver-clova-ix/donut-base-finetuned-rvlcdip- Document classification into document types (letter, invoice, memo, etc.)naver-clova-ix/donut-base-finetuned-docvqa- Document question answering (ask natural language questions about a document image)
See the model catalog (Extract) for the complete list of vision and extraction models.
What’s Next
Section titled “What’s Next”- OCR - convert document images and PDFs to Markdown
- NER & Entity Extraction - named entity recognition
- Relations & Classification - relation extraction and text classification
- Full model catalog - all supported models