Vision Tasks

SIE supports three families of vision extraction beyond OCR:

Image Captioning (microsoft/Florence-2-base). Describe images with <CAPTION> and <DETAILED_CAPTION> task tokens.
Object detection (IDEA-Research/grounding-dino-base, google/owlv2-base-patch16-ensemble). Return DetectedObject results with zero-shot labels and bounding boxes.
Visual QA (naver-clova-ix/donut-base-finetuned-docvqa, Florence-2 <DocVQA>). Answer natural-language questions about an image or document.

For converting document images or PDFs to Markdown (including the four dedicated OCR adapters and Florence-2’s <OCR> task), see OCR.

Pick by what you need to extract: a description of an image → image captioning; specific objects → detection; an answer to a question about an image → visual QA.

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.extract(
    "microsoft/Florence-2-base",
    Item(images=[{"data": image_bytes, "format": "jpeg"}]),
    options={"task": "<CAPTION>"}
)

for entity in result["entities"]:
    print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."

import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.extract(
  "microsoft/Florence-2-base",
  { images: [imageBytes] },  // Uint8Array of JPEG/PNG data
  { options: { task: "<CAPTION>" } }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

await client.close();

Florence-2 Task Prompts

Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>. For Florence-2 OCR usage and code samples, see OCR.

Task	Task Token	Output
OCR	`<OCR>`	Extracted text
OCR with regions	`<OCR_WITH_REGION>`	Text with bounding boxes (default)
Caption	`<CAPTION>`	Image description
Detailed caption	`<DETAILED_CAPTION>`	Extended description
Object detection	`<OD>`	Bounding boxes and labels
Dense region caption	`<DENSE_REGION_CAPTION>`	Region descriptions
Phrase grounding	`<CAPTION_TO_PHRASE_GROUNDING>`	Match labels to regions
Document QA	`<DocVQA>`	Answer to question

Object Detection

GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as DetectedObject instances with bounding boxes.

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.extract(
    "IDEA-Research/grounding-dino-base",
    Item(images=[{"data": image_bytes, "format": "jpeg"}]),
    labels=["car", "person", "traffic light"]
)

for obj in result["objects"]:
    print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")
# car: score=0.92, bbox=[120, 200, 450, 380]
# person: score=0.88, bbox=[50, 100, 150, 350]

import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.extract(
  "IDEA-Research/grounding-dino-base",
  { images: [imageBytes] },
  { labels: ["car", "person", "traffic light"] }
);

for (const obj of result.objects) {
  console.log(`${obj.label}: score=${obj.score.toFixed(2)}, bbox=${JSON.stringify(obj.bbox)}`);
}
// car: score=0.92, bbox=[120,200,450,380]
// person: score=0.88, bbox=[50,100,150,350]

await client.close();

DetectedObject Fields

Field	Type	Description
`label`	`str`	Object class
`score`	`float`	Confidence score (0-1)
`bbox`	`list`	Bounding box `[x1, y1, x2, y2]`

Visual QA

For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):

Python
TypeScript

result = client.extract(
    "naver-clova-ix/donut-base-finetuned-docvqa",
    Item(images=[{"data": receipt_image, "format": "jpeg"}]),
    instruction="What is the total amount?"
)

for entity in result["entities"]:
    print(entity["text"])
# "$42.50"

const result = await client.extract(
  "naver-clova-ix/donut-base-finetuned-docvqa",
  { images: [receiptImage] },
  { instruction: "What is the total amount?" }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

Donut Models

Donut models parse structured documents without OCR pre-processing:

naver-clova-ix/donut-base-finetuned-cord-v2 - Receipt parsing with key-value extraction (totals, line items, dates)
naver-clova-ix/donut-base-finetuned-rvlcdip - Document classification into document types (letter, invoice, memo, etc.)
naver-clova-ix/donut-base-finetuned-docvqa - Document question answering (ask natural language questions about a document image)

See the model catalog (Extract) for the complete list of vision and extraction models.

What’s Next

OCR - convert document images and PDFs to Markdown
NER & Entity Extraction - named entity recognition
Relations & Classification - relation extraction and text classification
Full model catalog - all supported models