Swap an OCR model with one identifier change

Try it on Hugging Face Spaces superlinked/document-ocr · zero install, click any sample

View on GitHub examples/document-ocr

What this is

OCR is rarely a single-model problem. A real pipeline has three concerns:

Recognition (image to text). VLM-OCRs like LightOnOCR or PaddleOCR-VL take a whole document and emit Markdown.
Structured extraction (image to JSON). End-to-end document models like Donut on CORD skip the text intermediate entirely and emit nested JSON directly.
Zero-shot NER on the recognized text (text to typed fields). When you want to declare entity labels at query time instead of fine-tuning a new model.

This demo wires all three behind one SIE server. Pick a sample document on the left, swap any of the three models in the dropdowns, watch SIE hot-swap the underlying architecture with a single identifier change.

# Recognition: VLM-OCR, returns Markdown
client.extract("lightonai/LightOnOCR-2-1B", Item(images=[image_bytes]))

# Structured: end-to-end Donut, returns JSON tree
client.extract("naver-clova-ix/donut-base-finetuned-cord-v2", Item(images=[image_bytes]))

# NER: zero-shot, returns typed entities
client.extract(
    "urchade/gliner_multi-v2.1",
    Item(text=recognized_markdown),
    labels=["merchant", "total", "date"],
)

Each panel in the UI has a “See the SIE call” disclosure that shows the exact line of code that just ran. Swap a dropdown, the snippet updates with the one parameter that changed.

Why SIE specifically for OCR

You could build this with three SaaS APIs (one OCR provider, one document AI provider, one NER provider). It would work. It would also be three auth flows, three rate-limit budgets, three SDKs, and three deployment stories.

SIE collapses that into one process:

One server, three primitives. encode, score, extract. This demo uses extract for all three model classes.
One SDK call. client.extract(model_id, item) works for VLM-OCR, end-to-end document AI, and zero-shot NER. Swap the model ID alone.
Open source, runs in your VPC. Customer documents never leave the host running the compose. Compliance teams stop blocking you.
Same code laptop to Kubernetes. SIE ships a Helm chart, KEDA autoscaling, and Terraform modules. The code in this demo runs unchanged against a production cluster; only the URL changes.

Try it now

The fastest path is the hosted Hugging Face Space. It runs the same code as the local Docker version. First click on a fresh replica is slow (3-5 min cold load while LightOnOCR’s 4 GB of weights download); subsequent clicks are 20-30 s on the free CPU tier.

For local Docker (no API key, runs entirely on your machine):

git clone https://github.com/superlinked/sie
cd sie/examples/document-ocr
npm install
npm start

npm start runs docker compose up -d (boots ghcr.io/superlinked/sie-server:latest-cpu-transformers5, preloads three small models), then starts a Node UI server and opens http://localhost:3032.

What runs in this demo

Recognition (default lightonai/LightOnOCR-2-1B): Pixtral encoder + Qwen3 decoder, 2.1B. Produces Markdown directly. Alternates include PaddleOCR-VL-1.5 and GLM-OCR (both GPU-only; the UI auto-disables them on the CPU image).
Structured extraction (default naver-clova-ix/donut-base-finetuned-cord-v2): fine-tuned for the CORD receipt schema. Alternates include Donut on DocVQA and Donut on RVL-CDIP (16-class document classifier).
Zero-shot NER (default urchade/gliner_multi-v2.1): 280M, multilingual, declare labels at query time. Alternates include GLiNER large, GLiNER PII, and NuMind’s NuNER Zero (different architecture, same SDK call).

What you can do

Click any sample (receipt, invoice, business-card, event-poster, slide, letter) and watch all three model classes run in one pipeline. The footer prints per-stage timings.
Swap the recognition dropdown. The “See the SIE call” disclosure updates with the new model ID, nothing else changes.
Swap the structured dropdown from donut-cord-v2 to donut-rvlcdip. Same architecture, different fine-tune. Output shape changes from a CORD-shaped JSON tree to a 16-class document classification.
Swap NER from gliner_multi-v2.1 to NuNER_Zero. Different model family entirely (NuMind vs urchade), same SDK call, same labels.
Compare two samples back to back: click receipt.png (Donut on CORD dominates because that’s its fine-tuning distribution), then click letter.png (the opposite: Donut produces garbage shaped like CORD; recognition + GLiNER carry the pipeline).

SIE features used

extract for all three model classes

Source

The example lives in examples/document-ocr in the main SIE repo. The hosted Hugging Face Space is built from the same source.