Pre-processing

What is a Document Extraction Pipeline?

A document extraction pipeline is an automated system that ingests raw documents (PDFs, Word files, scanned images, HTML pages), extracts structured information from them (text, tables, entities, metadata), and prepares the output for downstream use in search indexes, databases, or RAG systems. It combines OCR, layout detection, parsing, cleaning, and chunking into a repeatable, scalable workflow.

Why do document extraction pipelines matter?

Enterprise knowledge is rarely clean, structured text. It lives in contracts, invoices, reports, presentations, emails, and scanned forms. Building a RAG system or semantic search index over this content requires a pipeline that can reliably handle all these formats and produce consistent, high-quality text output.

Poor extraction is the most common cause of poor RAG quality. An LLM can only reason about what it’s given. If extraction produces garbled text or missing sections, the downstream answers will be wrong regardless of model quality.

What are the stages of a document extraction pipeline?

1. Ingestion

Accept documents from various sources and formats:

PDF (digital and scanned)
Word, Excel, PowerPoint
HTML / web pages
Images (JPEG, PNG, TIFF)
Email formats (EML, MSG)

2. Format detection and routing

Determine the document type and route to the appropriate extraction method:

def route_document(file_path):
    mime = detect_mime_type(file_path)
    if mime == "application/pdf":
        if has_text_layer(file_path):
            return extract_pdf_text(file_path)      # digital PDF
        else:
            return extract_with_ocr(file_path)      # scanned PDF
    elif mime in ["image/jpeg", "image/png"]:
        return extract_with_ocr(file_path)
    elif mime == "application/vnd.openxmlformats...":
        return extract_docx(file_path)

3. Text and structure extraction

Extract not just text but structural information:

Headings hierarchy: H1, H2, H3 for navigation and chunking
Tables: preserve row/column structure
Lists: preserve ordered/unordered structure
Figures and captions: extract alt text and captions
Headers and footers: often noise; usually filtered
Page numbers and watermarks: usually filtered

4. Cleaning and normalisation

Post-process raw extracted text:

Remove OCR artefacts (broken words, stray characters)
Normalise whitespace and line breaks
Detect and merge hyphenated words split across lines
Standardise encoding (UTF-8)
Remove boilerplate (legal disclaimers, cookie notices)

5. Metadata extraction

Extract structured metadata alongside content:

Document title, author, creation date
Section headings as navigation metadata
Named entities (organisations, people, dates, amounts)
Language detection

6. Chunking

Split the cleaned, structured text into appropriately-sized segments for embedding (see: chunking strategies).

7. Embedding

Encode chunks into vectors via SIE for indexing.

Document extraction pipeline architecture

Raw documents (S3 / GCS / upload)
          ↓
   [Format detection]
          ↓
   [OCR if needed] ← SIE: Florence-2, Surya
          ↓
  [Layout / structure parsing]
          ↓
  [Cleaning + normalisation]
          ↓
  [Entity / metadata extraction]
          ↓
     [Chunking]
          ↓
  [Embedding] ← SIE: BGE-M3, E5
          ↓
   [Vector DB indexing]

SIE handles the compute-intensive steps: OCR and embedding. Both run in your own AWS or GCP account.

Common tools for document extraction

Tool	Best for
pdfplumber / pymupdf	Digital PDF text extraction
Surya	High-accuracy OCR and layout detection
Marker	PDF → Markdown with structure
Unstructured.io	Multi-format pipeline orchestration
Florence-2 (via SIE)	Vision-language document understanding
Docling	IBM’s multi-format document parser
LlamaIndex / LangChain	Pipeline orchestration with many integrations

How SIE fits into a document extraction pipeline

from sie_sdk import SIEClient
from sie_sdk.types import Item
import unstructured

client = SIEClient("http://localhost:8080")

# Step 1: Extract with Unstructured
elements = unstructured.partition_pdf("contract.pdf", strategy="hi_res")

# Step 2: Clean and chunk
chunks = []
for element in elements:
    if element.category in ["NarrativeText", "Title", "ListItem"]:
        chunks.extend(recursive_chunk(element.text, max_tokens=512))

# Step 3: Embed with SIE — batch for efficiency
encode_results = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
vectors = [r["dense"] for r in encode_results]

# Step 4: Index in vector DB
for chunk, vector in zip(chunks, vectors):
    vector_db.upsert(text=chunk, vector=vector)

For high-quality OCR as part of the extraction step, SIE’s Florence-2 integration handles complex layouts, tables, and mixed-language documents.

Frequently asked questions

How do you handle tables in document extraction? Tables require specialised handling. They can’t be linearised (read left-to-right, top-to-bottom) without losing structure. Options: convert to Markdown table format, serialize as JSON, or extract each cell with row/column metadata. For RAG, Markdown table format works well since LLMs can reason over it.

What is the difference between extraction and parsing? Extraction pulls raw content out of a file. Parsing interprets that content, identifying structure, extracting entities, and normalising format. Both are stages in a full document extraction pipeline.

How do you keep the extraction pipeline up to date as documents change? Implement a document hash check: re-process only documents whose content has changed since the last extraction. Store extraction results with document metadata so unchanged documents can be served from cache.