Deep Learning

What is an OCR Model?

An OCR (Optical Character Recognition) model is a machine learning system that extracts text from images, scanned documents, or PDFs. Modern OCR models use deep learning (typically a CNN encoder to detect text regions and a sequence model to decode characters) and can handle handwriting, complex layouts, tables, and multi-language documents that rule-based OCR systems fail on.

Why does OCR matter for document processing pipelines?

A significant portion of enterprise knowledge exists in non-machine-readable formats: scanned contracts, PDF invoices, printed reports, photographed receipts. Before this content can be indexed for semantic search or RAG, it must be converted to text.

Poor OCR is one of the most common causes of degraded RAG quality. If the text extraction step introduces errors, garbled text, or missed content, every downstream step (chunking, embedding, retrieval) inherits those errors.

How does a modern OCR model work?

Modern deep learning OCR uses a multi-stage pipeline:

1. Layout detection: a CNN detects regions of interest: text blocks, tables, figures, headers. This stage handles complex multi-column layouts and mixed content pages.

2. Text line segmentation: detected regions are segmented into individual text lines or words.

3. Text recognition: each segment is passed through a sequence model (typically a CTC-based or attention-based decoder) that maps pixel sequences to character sequences.

4. Post-processing: a language model may correct recognition errors using contextual probability (e.g. detecting that “rhe” should be “the”).

Types of OCR models

Type	Best for	Examples
Traditional (Tesseract)	Clean, simple printed text	Tesseract v4/5
Deep learning OCR	Complex layouts, handwriting	EasyOCR, PaddleOCR
Document understanding	Tables, forms, structured docs	LayoutLM, Donut
Vision-language models	Any document type + Q&A	Florence-2, Qwen-VL
SOTA document extraction	Production-grade accuracy	Surya, Marker

SIE supports Florence-2 and other vision-language models for document understanding that go beyond text extraction to structured information extraction.

OCR challenges in practice

Complex layouts: multi-column PDFs, tables, sidebars, and footnotes confuse simple OCR. Layout-aware models like LayoutLM detect structure before extracting text.

Low-resolution scans: document images below ~150 DPI degrade OCR accuracy significantly. Pre-processing with upscaling or denoising helps.

Handwriting: much harder than printed text. Specialised handwriting recognition models are required.

Tables: extracting structured data from tables requires both OCR and table structure detection. Donut and similar models handle this end-to-end.

Mathematical notation: standard OCR cannot handle equations. TeX-specialised models (Nougat) are needed for academic papers.

How OCR fits into a RAG pipeline

Raw PDFs / scanned docs
         ↓
    [OCR model] ← SIE hosts Florence-2, Surya
         ↓
  Extracted text + structure
         ↓
   [Chunking strategy]
         ↓
  [Embedding model] ← SIE hosts BGE-M3, E5
         ↓
    Vector database
         ↓
  [Retrieval + reranking]
         ↓
     LLM generation

SIE handles both the OCR and embedding steps in one self-hosted cluster, keeping all document content within your own cloud infrastructure.

Using OCR models via SIE

from sie_sdk import SIEClient
from sie_sdk.types import Item
from pathlib import Path

client = SIEClient("http://localhost:8080")

# Extract text from a scanned PDF or image
with open("contract.pdf", "rb") as f:
    result = client.extract(
        "microsoft/Florence-2-base",
        file=f,
        task="document_ocr"
    )

extracted_text = result.text
layout = result.layout  # sections, tables, headings

# Now chunk and embed
chunks = semantic_chunks(extracted_text)
encode_results = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
vectors = [r["dense"] for r in encode_results]

OCR quality evaluation

Key metrics for evaluating OCR quality:

Character Error Rate (CER): percentage of incorrectly recognised characters. Best metric for raw accuracy.

Word Error Rate (WER): percentage of incorrectly recognised words. More meaningful for downstream NLP tasks.

Layout accuracy: does the extracted structure (headings, tables, columns) match the original? Crucial for structured document RAG.

For production pipelines, sample 50-100 representative documents and manually verify OCR output before committing to a model.

Frequently asked questions

What is the difference between OCR and document understanding? OCR extracts text from images. Document understanding goes further: it understands the structure (which text is a heading, which is a table cell, what do the numbers in a table mean). Models like Florence-2 and Donut perform document understanding.

Can standard PDF text extraction replace OCR? For digitally-created PDFs (not scans), yes. pdfplumber or pymupdf can extract text directly without OCR. OCR is only needed for scanned documents or image-based PDFs.

How do I handle PDFs that mix digital and scanned pages? Detect page type first: if a PDF page has no embedded text layer, apply OCR. If it does, use direct text extraction. Many document processing libraries handle this automatically.