---
title: Vision Tasks
description: Caption images, detect objects, and answer visual questions with vision-language models.
canonical_url: https://superlinked.com/docs/extract/vision
last_updated: 2026-05-18
---

SIE supports three families of vision extraction beyond OCR:

- **Image Captioning** (`microsoft/Florence-2-base`). Describe images with `<CAPTION>` and `<DETAILED_CAPTION>` task tokens.
- **Object detection** (`IDEA-Research/grounding-dino-base`, `google/owlv2-base-patch16-ensemble`). Return `DetectedObject` results with zero-shot labels and bounding boxes.
- **Visual QA** (`naver-clova-ix/donut-base-finetuned-docvqa`, Florence-2 `<DocVQA>`). Answer natural-language questions about an image or document.

For converting document images or PDFs to Markdown (including the four dedicated OCR adapters and Florence-2's `<OCR>` task), see [OCR](/docs/extract/ocr/).

Pick by what you need to extract: a description of an image → image captioning; specific objects → detection; an answer to a question about an image → visual QA.

## Image Captioning

Source: [packages/sie_server/src/sie_server/adapters/florence2/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/florence2/__init__.py)

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.extract(
    "microsoft/Florence-2-base",
    Item(images=[{"data": image_bytes, "format": "jpeg"}]),
    options={"task": "<CAPTION>"}
)

for entity in result["entities"]:
    print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.extract(
  "microsoft/Florence-2-base",
  { images: [imageBytes] },  // Uint8Array of JPEG/PNG data
  { options: { task: "<CAPTION>" } }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

await client.close();
```

### Florence-2 Task Prompts

Florence-2 tasks are selected via `options={"task": "<TASK_TOKEN>"}`. The default task is `<OCR_WITH_REGION>`. For Florence-2 OCR usage and code samples, see [OCR](/docs/extract/ocr/#ocr-text-from-images).

| Task | Task Token | Output |
|------|-----------|--------|
| OCR | `<OCR>` | Extracted text |
| OCR with regions | `<OCR_WITH_REGION>` | Text with bounding boxes (default) |
| Caption | `<CAPTION>` | Image description |
| Detailed caption | `<DETAILED_CAPTION>` | Extended description |
| Object detection | `<OD>` | Bounding boxes and labels |
| Dense region caption | `<DENSE_REGION_CAPTION>` | Region descriptions |
| Phrase grounding | `<CAPTION_TO_PHRASE_GROUNDING>` | Match labels to regions |
| Document QA | `<DocVQA>` | Answer to question |

:::caution
Do **not** pass task tokens like `<OCR>` via the `instruction` parameter. The `instruction` parameter appends free text to the task prompt - passing a task token there produces an invalid prompt like `<OCR_WITH_REGION><OCR>`. Use `options={"task": "<OCR>"}` instead.
:::

## Object Detection

GroundingDINO and OWL-v2 models detect objects in images with zero-shot label support. Results are returned as `DetectedObject` instances with bounding boxes.

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.extract(
    "IDEA-Research/grounding-dino-base",
    Item(images=[{"data": image_bytes, "format": "jpeg"}]),
    labels=["car", "person", "traffic light"]
)

for obj in result["objects"]:
    print(f"{obj['label']}: score={obj['score']:.2f}, bbox={obj['bbox']}")
# car: score=0.92, bbox=[120, 200, 450, 380]
# person: score=0.88, bbox=[50, 100, 150, 350]
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.extract(
  "IDEA-Research/grounding-dino-base",
  { images: [imageBytes] },
  { labels: ["car", "person", "traffic light"] }
);

for (const obj of result.objects) {
  console.log(`${obj.label}: score=${obj.score.toFixed(2)}, bbox=${JSON.stringify(obj.bbox)}`);
}
// car: score=0.92, bbox=[120,200,450,380]
// person: score=0.88, bbox=[50,100,150,350]

await client.close();
```

### DetectedObject Fields

| Field | Type | Description |
|-------|------|-------------|
| `label` | `str` | Object class |
| `score` | `float` | Confidence score (0-1) |
| `bbox` | `list` | Bounding box `[x1, y1, x2, y2]` |

## Visual QA

For Donut models, the question is passed via the `instruction` parameter (free text appended to the task prompt):

#### Python

```python
result = client.extract(
    "naver-clova-ix/donut-base-finetuned-docvqa",
    Item(images=[{"data": receipt_image, "format": "jpeg"}]),
    instruction="What is the total amount?"
)

for entity in result["entities"]:
    print(entity["text"])
# "$42.50"
```

#### TypeScript

```typescript
const result = await client.extract(
  "naver-clova-ix/donut-base-finetuned-docvqa",
  { images: [receiptImage] },
  { instruction: "What is the total amount?" }
);

for (const entity of result.entities) {
  console.log(entity.text);
}
```

### Donut Models

Donut models parse structured documents without OCR pre-processing:

- **`naver-clova-ix/donut-base-finetuned-cord-v2`** - Receipt parsing with key-value extraction (totals, line items, dates)
- **`naver-clova-ix/donut-base-finetuned-rvlcdip`** - Document classification into document types (letter, invoice, memo, etc.)
- **`naver-clova-ix/donut-base-finetuned-docvqa`** - Document question answering (ask natural language questions about a document image)

See the [model catalog (Extract)](/models#task=extract) for the complete list of vision and extraction models.

## What's Next

- [OCR](/docs/extract/ocr/) - convert document images and PDFs to Markdown
- [NER & Entity Extraction](/docs/extract/) - named entity recognition
- [Relations & Classification](/docs/extract/relations/) - relation extraction and text classification
- [Full model catalog](/models#task=extract) - all supported models
