OCR
SIE supports OCR via four dedicated models plus Florence-2’s <OCR> task:
- OCR models (
zai-org/GLM-OCR,lightonai/LightOnOCR-2-1B,PaddlePaddle/PaddleOCR-VL-1.5,docling). Convert document images or PDFs to Markdown, preserving tables and headings. - Florence-2 (
microsoft/Florence-2-base). Flat-text OCR via the<OCR>and<OCR_WITH_REGION>task tokens.
For image captioning, object detection, and document QA, see Vision Tasks.
Pick by what you need to extract: structured Markdown for downstream chunking → one of the four dedicated OCR models; flat text or bounding-box OCR over a single image → Florence-2.
OCR Models
Section titled “OCR Models”For converting document images or PDFs to Markdown, use one of the four dedicated OCR models. They preserve tables, headings, and reading order; Florence-2’s <OCR> task only returns flat text.
| Model | Input | Best for | Notes |
|---|---|---|---|
zai-org/GLM-OCR | Image | High-quality multilingual OCR | CogViT + GLM-0.5B; bfloat16 only |
lightonai/LightOnOCR-2-1B | Image | Larger model, 2.1B params | Pixtral encoder + Qwen3 decoder |
PaddlePaddle/PaddleOCR-VL-1.5 | Image | 109 languages, multi-mode (table/formula/chart) | 0.9B params; smallest |
docling | Document (PDF/DOCX/HTML) | Multi-page documents, layout-aware | Composite pipeline; OCR is opt-in |
GLM-OCR
Section titled “GLM-OCR”GLM-OCR returns a single Markdown string per page in entities[0].text.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
with open("page.png", "rb") as f: page_bytes = f.read()
result = client.extract( "zai-org/GLM-OCR", Item(images=[{"data": page_bytes, "format": "png"}]),)markdown = result["entities"][0]["text"]print(markdown)import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "zai-org/GLM-OCR", { images: [pageBytes] }, // Uint8Array of PNG/JPEG data);const markdown = result.entities[0].text;console.log(markdown);
await client.close();LightOnOCR-2-1B
Section titled “LightOnOCR-2-1B”Same call shape as GLM-OCR: one image per item, Markdown returned in entities[0].text.
result = client.extract( "lightonai/LightOnOCR-2-1B", Item(images=[{"data": page_bytes, "format": "png"}]),)markdown = result["entities"][0]["text"]const result = await client.extract( "lightonai/LightOnOCR-2-1B", { images: [pageBytes] },);const markdown = result.entities[0].text;PaddleOCR-VL-1.5
Section titled “PaddleOCR-VL-1.5”PaddleOCR-VL supports six task modes via options.task: ocr (default), table, formula, chart, spotting, seal.
# Default OCR moderesult = client.extract( "PaddlePaddle/PaddleOCR-VL-1.5", Item(images=[{"data": page_bytes, "format": "png"}]),)markdown = result["entities"][0]["text"]
# Table-extraction moderesult = client.extract( "PaddlePaddle/PaddleOCR-VL-1.5", Item(images=[{"data": table_image, "format": "png"}]), options={"task": "table"},)const result = await client.extract( "PaddlePaddle/PaddleOCR-VL-1.5", { images: [pageBytes] }, { options: { task: "table" } }, // or "ocr", "formula", "chart", "spotting", "seal");const markdown = result.entities[0].text;Docling (multi-page documents)
Section titled “Docling (multi-page documents)”Docling parses entire PDF/DOCX/HTML files in one call, preserving layout, tables, and headings. Output goes to data (not entities):
| Field | Type | Description |
|---|---|---|
text | str | Plain-text rendering |
markdown | str | Markdown with tables and headings preserved |
document | dict | Full DoclingDocument JSON for downstream chunkers |
Docling ships two profiles:
| Profile | What it does | When to use |
|---|---|---|
default | Layout + table-structure parsing; uses embedded text from PDFs | Born-digital PDFs and DOCX/HTML (fastest) |
ocr | Same as default + runs OCR on rasterized pages | Scanned PDFs or images-only documents |
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
with open("report.pdf", "rb") as f: pdf_bytes = f.read()
# Default profile: fast, no OCR (born-digital PDFs only)result = client.extract( "docling", Item(document={"data": pdf_bytes, "format": "pdf"}),)markdown = result["data"]["markdown"]
# OCR profile: needed for scanned PDFsresult_ocr = client.extract( "docling", Item(document={"data": pdf_bytes, "format": "pdf"}), options={"profile": "ocr"},)import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "docling", { document: { data: pdfBytes, format: "pdf" } },);const markdown = result.data.markdown;
// OCR profile: needed for scanned PDFsconst resultOcr = await client.extract( "docling", { document: { data: pdfBytes, format: "pdf" } }, { options: { profile: "ocr" } },);
await client.close();OCR (Text from Images)
Section titled “OCR (Text from Images)”For flat-text OCR without layout, the microsoft/Florence-2-base model exposes an <OCR> task token:
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR>"})
for entity in result["entities"]: print(entity["text"])# Extracted text from the document imageconst result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, // Uint8Array of PNG data { options: { task: "<OCR>" } });
for (const entity of result.entities) { console.log(entity.text);}OCR with Regions
Section titled “OCR with Regions”To get text with bounding box positions (the default task):
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR_WITH_REGION>"})
for entity in result["entities"]: print(f"{entity['text']} at {entity['bbox']}")const result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, { options: { task: "<OCR_WITH_REGION>" } });
for (const entity of result.entities) { console.log(`${entity.text} at ${JSON.stringify(entity.bbox)}`);}See Vision Tasks for the full list of Florence-2 task tokens.
What’s Next
Section titled “What’s Next”- Vision Tasks - image captioning, object detection, and document understanding
- NER & Entity Extraction - named entity recognition
- Relations & Classification - relation extraction and text classification
- Full model catalog - all supported models