---
title: Swap an OCR model with one identifier change
description: A multi-model OCR pipeline where the recognition VLM, the end-to-end document model, and the zero-shot NER are all the same SIE extract call. Only the model ID changes.
canonical_url: https://superlinked.com/docs/examples/document-ocr
last_updated: 2026-05-14
---

<LinkCard title="Try it on Hugging Face Spaces" description="superlinked/document-ocr · zero install, click any sample" href="https://huggingface.co/spaces/superlinked/document-ocr" />

<LinkCard title="View on GitHub" description="examples/document-ocr" href="https://github.com/superlinked/sie/tree/main/examples/document-ocr" />

## What this is

OCR is rarely a single-model problem. A real pipeline has three concerns:

1. **Recognition** (image to text). VLM-OCRs like LightOnOCR or PaddleOCR-VL take a whole document and emit Markdown.
2. **Structured extraction** (image to JSON). End-to-end document models like Donut on CORD skip the text intermediate entirely and emit nested JSON directly.
3. **Zero-shot NER on the recognized text** (text to typed fields). When you want to declare entity labels at query time instead of fine-tuning a new model.

This demo wires all three behind one SIE server. Pick a sample document on the left, swap any of the three models in the dropdowns, watch SIE hot-swap the underlying architecture with a single identifier change.

```python
# Recognition: VLM-OCR, returns Markdown
client.extract("lightonai/LightOnOCR-2-1B", Item(images=[image_bytes]))

# Structured: end-to-end Donut, returns JSON tree
client.extract("naver-clova-ix/donut-base-finetuned-cord-v2", Item(images=[image_bytes]))

# NER: zero-shot, returns typed entities
client.extract(
    "urchade/gliner_multi-v2.1",
    Item(text=recognized_markdown),
    labels=["merchant", "total", "date"],
)
```

Each panel in the UI has a "See the SIE call" disclosure that shows the exact line of code that just ran. Swap a dropdown, the snippet updates with the one parameter that changed.

## Why SIE specifically for OCR

You could build this with three SaaS APIs (one OCR provider, one document AI provider, one NER provider). It would work. It would also be three auth flows, three rate-limit budgets, three SDKs, and three deployment stories.

SIE collapses that into one process:

- **One server, three primitives.** `encode`, `score`, `extract`. This demo uses `extract` for all three model classes.
- **One SDK call.** `client.extract(model_id, item)` works for VLM-OCR, end-to-end document AI, and zero-shot NER. Swap the model ID alone.
- **Open source, runs in your VPC.** Customer documents never leave the host running the compose. Compliance teams stop blocking you.
- **Same code laptop to Kubernetes.** SIE ships a Helm chart, KEDA autoscaling, and Terraform modules. The code in this demo runs unchanged against a production cluster; only the URL changes.

## Try it now

The fastest path is the [hosted Hugging Face Space](https://huggingface.co/spaces/superlinked/document-ocr). It runs the same code as the local Docker version. First click on a fresh replica is slow (3-5 min cold load while LightOnOCR's 4 GB of weights download); subsequent clicks are 20-30 s on the free CPU tier.

For local Docker (no API key, runs entirely on your machine):

```bash
git clone https://github.com/superlinked/sie
cd sie/examples/document-ocr
npm install
npm start
```

`npm start` runs `docker compose up -d` (boots `ghcr.io/superlinked/sie-server:latest-cpu-transformers5`, preloads three small models), then starts a Node UI server and opens `http://localhost:3032`.

> **Tip — First start downloads ~5 GB of model weights:**
>
> LightOnOCR-2-1B is ~4 GB on its own. Weights cache in a named Docker volume (`sie-cache`) so `docker compose down` followed by `docker compose up` skips the download.

## What runs in this demo

- **Recognition** (default `lightonai/LightOnOCR-2-1B`): Pixtral encoder + Qwen3 decoder, 2.1B. Produces Markdown directly. Alternates include PaddleOCR-VL-1.5 and GLM-OCR (both GPU-only; the UI auto-disables them on the CPU image).
- **Structured extraction** (default `naver-clova-ix/donut-base-finetuned-cord-v2`): fine-tuned for the CORD receipt schema. Alternates include Donut on DocVQA and Donut on RVL-CDIP (16-class document classifier).
- **Zero-shot NER** (default `urchade/gliner_multi-v2.1`): 280M, multilingual, declare labels at query time. Alternates include GLiNER large, GLiNER PII, and NuMind's NuNER Zero (different architecture, same SDK call).

## What you can do

- Click any sample (`receipt`, `invoice`, `business-card`, `event-poster`, `slide`, `letter`) and watch all three model classes run in one pipeline. The footer prints per-stage timings.
- Swap the recognition dropdown. The "See the SIE call" disclosure updates with the new model ID, nothing else changes.
- Swap the structured dropdown from `donut-cord-v2` to `donut-rvlcdip`. Same architecture, different fine-tune. Output shape changes from a CORD-shaped JSON tree to a 16-class document classification.
- Swap NER from `gliner_multi-v2.1` to `NuNER_Zero`. Different model family entirely (NuMind vs urchade), same SDK call, same labels.
- Compare two samples back to back: click `receipt.png` (Donut on CORD dominates because that's its fine-tuning distribution), then click `letter.png` (the opposite: Donut produces garbage shaped like CORD; recognition + GLiNER carry the pipeline).

## SIE features used

- `extract` for all three model classes

## Source

The example lives in [examples/document-ocr](https://github.com/superlinked/sie/tree/main/examples/document-ocr) in the main SIE repo. The hosted Hugging Face Space is built from the same source.
