Swap an OCR model with one identifier change
What this is
Section titled “What this is”OCR is rarely a single-model problem. A real pipeline has three concerns:
- Recognition (image to text). VLM-OCRs like LightOnOCR or PaddleOCR-VL take a whole document and emit Markdown.
- Structured extraction (image to JSON). End-to-end document models like Donut on CORD skip the text intermediate entirely and emit nested JSON directly.
- Zero-shot NER on the recognized text (text to typed fields). When you want to declare entity labels at query time instead of fine-tuning a new model.
This demo wires all three behind one SIE server. Pick a sample document on the left, swap any of the three models in the dropdowns, watch SIE hot-swap the underlying architecture with a single identifier change.
# Recognition: VLM-OCR, returns Markdownclient.extract("lightonai/LightOnOCR-2-1B", Item(images=[image_bytes]))
# Structured: end-to-end Donut, returns JSON treeclient.extract("naver-clova-ix/donut-base-finetuned-cord-v2", Item(images=[image_bytes]))
# NER: zero-shot, returns typed entitiesclient.extract( "urchade/gliner_multi-v2.1", Item(text=recognized_markdown), labels=["merchant", "total", "date"],)Each panel in the UI has a “See the SIE call” disclosure that shows the exact line of code that just ran. Swap a dropdown, the snippet updates with the one parameter that changed.
Why SIE specifically for OCR
Section titled “Why SIE specifically for OCR”You could build this with three SaaS APIs (one OCR provider, one document AI provider, one NER provider). It would work. It would also be three auth flows, three rate-limit budgets, three SDKs, and three deployment stories.
SIE collapses that into one process:
- One server, three primitives.
encode,score,extract. This demo usesextractfor all three model classes. - One SDK call.
client.extract(model_id, item)works for VLM-OCR, end-to-end document AI, and zero-shot NER. Swap the model ID alone. - Open source, runs in your VPC. Customer documents never leave the host running the compose. Compliance teams stop blocking you.
- Same code laptop to Kubernetes. SIE ships a Helm chart, KEDA autoscaling, and Terraform modules. The code in this demo runs unchanged against a production cluster; only the URL changes.
Try it now
Section titled “Try it now”The fastest path is the hosted Hugging Face Space. It runs the same code as the local Docker version. First click on a fresh replica is slow (3-5 min cold load while LightOnOCR’s 4 GB of weights download); subsequent clicks are 20-30 s on the free CPU tier.
For local Docker (no API key, runs entirely on your machine):
git clone https://github.com/superlinked/siecd sie/examples/document-ocrnpm installnpm startnpm start runs docker compose up -d (boots ghcr.io/superlinked/sie-server:latest-cpu-transformers5, preloads three small models), then starts a Node UI server and opens http://localhost:3032.
What runs in this demo
Section titled “What runs in this demo”- Recognition (default
lightonai/LightOnOCR-2-1B): Pixtral encoder + Qwen3 decoder, 2.1B. Produces Markdown directly. Alternates include PaddleOCR-VL-1.5 and GLM-OCR (both GPU-only; the UI auto-disables them on the CPU image). - Structured extraction (default
naver-clova-ix/donut-base-finetuned-cord-v2): fine-tuned for the CORD receipt schema. Alternates include Donut on DocVQA and Donut on RVL-CDIP (16-class document classifier). - Zero-shot NER (default
urchade/gliner_multi-v2.1): 280M, multilingual, declare labels at query time. Alternates include GLiNER large, GLiNER PII, and NuMind’s NuNER Zero (different architecture, same SDK call).
What you can do
Section titled “What you can do”- Click any sample (
receipt,invoice,business-card,event-poster,slide,letter) and watch all three model classes run in one pipeline. The footer prints per-stage timings. - Swap the recognition dropdown. The “See the SIE call” disclosure updates with the new model ID, nothing else changes.
- Swap the structured dropdown from
donut-cord-v2todonut-rvlcdip. Same architecture, different fine-tune. Output shape changes from a CORD-shaped JSON tree to a 16-class document classification. - Swap NER from
gliner_multi-v2.1toNuNER_Zero. Different model family entirely (NuMind vs urchade), same SDK call, same labels. - Compare two samples back to back: click
receipt.png(Donut on CORD dominates because that’s its fine-tuning distribution), then clickletter.png(the opposite: Donut produces garbage shaped like CORD; recognition + GLiNER carry the pipeline).
SIE features used
Section titled “SIE features used”extractfor all three model classes
Source
Section titled “Source”The example lives in examples/document-ocr in the main SIE repo. The hosted Hugging Face Space is built from the same source.