Why did we open-source our inference engine? Read the post
← All Posts

Top 6 Open-Source Alternatives to Hosted OCR APIs (Azure Document Intelligence, AWS Textract)

Top 6 Open-Source Alternatives to Hosted OCR APIs (Azure Document Intelligence, AWS Textract)

Azure Document Intelligence and AWS Textract are capable services, but they share the same shape on Azure and AWS pricing pages: per-page metering, documents processed on the provider’s infrastructure, and whatever model version the vendor ships. For contracts, invoices, medical records, and anything under a data-residency rule, sending pages to a third party is often the blocker.

The models below are practical self-hosted options for many of the same jobs. Modern OCR is often vision-language work: models read a whole page and emit Markdown or structured JSON, layout and tables included. Here are six listed in the SIE model catalog. Each runs behind the same extract API on SIE; swap models by changing the model ID.

1. Docling

docling is IBM’s document-conversion toolkit. Per its docs, it parses PDFs, Office files, and scans into structured Markdown with layout-aware headings, tables, and reading order. Try it when you want document text an LLM can consume rather than a fixed field schema. Docling handles the whole document pipeline, with optional OCR for scanned pages.

Best for: general document-to-Markdown conversion for RAG. SIE model ID: docling.

2. PaddleOCR-VL

PaddlePaddle/PaddleOCR-VL-1.5 is a vision-language OCR model from the PaddleOCR project. The model card lists multilingual support and layout parsing for dense, mixed-script pages.

Best for: multilingual and dense-layout documents. SIE model ID: PaddlePaddle/PaddleOCR-VL-1.5.

3. MinerU

opendatalab/MinerU2.5-Pro-2604-1.2B targets complex PDFs, including scientific papers with formulas, tables, and figures. The published weights are 1.2B parameters.

Best for: scientific and technical PDFs with equations and tables. SIE model ID: opendatalab/MinerU2.5-Pro-2604-1.2B.

4. GLM-OCR

zai-org/GLM-OCR is a VLM OCR model that maps page images to text. Use it as a baseline in the “image in, Markdown out” path and compare against PaddleOCR-VL and LightOnOCR on your own documents.

Best for: general page recognition to Markdown. SIE model ID: zai-org/GLM-OCR.

5. LightOnOCR

lightonai/LightOnOCR-2-1B is a ~1B-parameter VLM OCR model that emits Markdown from document images. It appears in the SIE document-ocr example.

Best for: high-throughput document-to-Markdown at a small footprint. SIE model ID: lightonai/LightOnOCR-2-1B.

6. Donut

naver-clova-ix/donut-base-finetuned-cord-v2 uses OCR-free document understanding: it reads a receipt image and emits structured JSON directly, for example { "total": { "total_price": "28.52" }, ... }. That matches the field-extraction workflows people often wire to Textract or Document Intelligence.

Best for: structured extraction from forms, receipts, and invoices. SIE model ID: naver-clova-ix/donut-base-finetuned-cord-v2.

Also worth knowing

microsoft/Florence-2-large is a versatile vision model with document-VQA and OCR modes, and it is also in the SIE catalog if you want a single model that spans OCR, captioning, and detection. Outside the SIE catalog, Tesseract and EasyOCR remain the classic open OCR engines. They are lightweight and battle-tested for plain text recognition, but they do not do layout-aware Markdown or structured extraction the way the VLM-based models above do, so treat them as a floor rather than a replacement for Textract’s structured output.

Why one engine matters for OCR

Hosted APIs hide this: OCR is almost never a single-model problem. A real pipeline has at least three concerns. Recognition turns an image into text or Markdown. Structured extraction turns an image into JSON fields. And zero-shot entity extraction pulls typed fields out of the recognized text when you want to declare labels at query time instead of fine-tuning a model.

On SIE, recognition, structured extraction, and zero-shot NER share one SDK surface (SIEClient.extract), the same auth, and the same Docker image. Only the model ID changes:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Recognition: VLM-OCR, returns Markdown
client.extract("lightonai/LightOnOCR-2-1B", Item(images=[image_bytes]))
# Structured: end-to-end Donut, returns a JSON tree
client.extract("naver-clova-ix/donut-base-finetuned-cord-v2", Item(images=[image_bytes]))
# Zero-shot NER on the recognized text, labels declared at query time
client.extract(
"urchade/gliner_multi-v2.1",
Item(text=recognized_markdown),
labels=["merchant", "total", "date", "line_item"],
)

That is the whole pitch. Instead of committing to one provider’s model, you keep every option open and pick per document, per request, against one endpoint. The document-ocr example is a working browser UI that runs the same document through a recognition model, a structured model, and a zero-shot NER, swapping one identifier at a time so you can see what changes. Run it locally with docker compose up.

With SIE on your own infrastructure, document bytes stay on your network. You pay for the GPUs you provision rather than a per-page API meter. Models share GPUs through on-demand loading. The same Docker image runs locally and on Kubernetes; see the quickstart and deployment guide for cluster scaling.

Get started

Frequently asked questions

Can these replace AWS Textract or Azure Document Intelligence?

For many document-to-text and document-to-JSON workflows, yes, though you should benchmark on your own documents. VLM models like Docling, PaddleOCR-VL, and MinerU target layout and Markdown output; Donut targets direct field extraction from forms and receipts.

Do my documents stay in my cloud?

When you run SIE on your own infrastructure, processing stays inside your network. Nothing in this flow requires sending pages to AWS or Azure OCR APIs.

How do I choose between the six?

Try two or three on your own documents. Use a recognition model (LightOnOCR, GLM-OCR, PaddleOCR-VL) for image-to-Markdown, MinerU or Docling for complex or technical PDFs, and Donut when you need structured fields rather than prose.

Can I combine OCR with search and extraction?

Yes. Recognition, embeddings, reranking, and entity extraction all run on the same SIE server through extract, encode, and score, so a document pipeline is one deployment rather than several.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.