All posts
Document Intelligence
April 22, 20269 min read

PDF and Document Parsing for AI Pipelines: Extracting Clean Text from Messy Real-World Files

M

Moneeb Abbas

AI Systems Architect

Every RAG system tutorial starts with clean, well-formatted text. Real client documents are PDFs with embedded images, scanned pages from a 1990s fax machine, tables that span multiple pages, headers and footers that repeat on every page, and text in three different columns. Getting clean text out of these files is not a solved problem — and poor parsing quality is the root cause of more RAG failures than bad retrieval strategies.

The Document Parsing Problem Space

Not all documents are the same, and the right parsing approach depends entirely on what you are dealing with:

  • Programmatic PDFs: Created by software (Word, LaTeX, PDF printers). Text is embedded as machine-readable characters. PyMuPDF or pdfplumber extract clean text reliably.
  • Scanned PDFs: Images of physical documents. No embedded text. Require OCR to extract any content at all.
  • Mixed PDFs: Partly programmatic, partly scanned — common in document-heavy industries where old files were digitized. Need to detect and handle both modes.
  • PDFs with complex layouts: Multi-column text, tables, forms, headers and footers. Even programmatic PDFs require layout-aware parsing to extract text in reading order.
  • Office documents (DOCX, XLSX, PPTX): Typically easier than PDFs, but contain their own structure quirks — nested tables, comments, tracked changes.

Parser Selection by Document Type

  • PyMuPDF (fitz): Fastest programmatic PDF parser. Excellent text extraction, good metadata access, handles encrypted PDFs. First choice for programmatic PDFs.
  • pdfplumber: Better table extraction than PyMuPDF, slower. Use it when tables are the primary content.
  • Tesseract OCR: Open-source OCR, reliable for clean scanned documents. Accuracy degrades significantly on low-resolution or heavily skewed scans.
  • AWS Textract / Google Document AI: Cloud OCR with layout understanding — detects tables, forms, key-value pairs. Significantly more accurate than Tesseract on complex layouts, at a cost.
  • Unstructured.io: Open-source library that handles multiple document types (PDF, DOCX, HTML, images) with a unified API and built-in OCR routing. Good default for pipelines that process mixed file types.
  • LlamaParse: LlamaIndex's document parser, optimized for RAG use cases. Uses vision models for layout understanding. Best quality for complex multi-column and table-heavy documents.
python
import fitz  # PyMuPDF
from PIL import Image
import pytesseract
import io

def parse_pdf(path: str) -> str:
    doc = fitz.open(path)
    pages = []

    for page in doc:
        # Try programmatic text extraction first
        text = page.get_text("text")

        if len(text.strip()) < 50:
            # Page is likely scanned — fall back to OCR
            pix = page.get_pixmap(dpi=300)
            img = Image.open(io.BytesIO(pix.tobytes("png")))
            text = pytesseract.image_to_string(img)

        pages.append(text)

    return "\n\n".join(pages)

The Preprocessing Steps That Actually Matter

Raw extracted text is rarely ready for a RAG pipeline. These preprocessing steps have the highest impact on downstream retrieval quality:

  • Header and footer removal: Most PDFs repeat document title, page number, and section headings on every page. These repeat in every chunk and confuse retrieval. Detect repeating lines across pages and strip them.
  • Whitespace normalization: OCR output and PDF text extraction both produce inconsistent whitespace — multiple spaces, inconsistent line breaks, hyphenated words split across lines. Normalize before chunking.
  • Table extraction to structured format: Tables parsed as raw text lose their structure. Extract tables as markdown or CSV and store them separately, or use a layout-aware parser that preserves tabular structure.
  • Ligature and encoding fixes: PDFs frequently encode ligatures (fi, fl, ff) as special characters that appear as '?' or garbage after extraction. Map common ligatures back to their ASCII equivalents.
  • Metadata extraction: Document title, author, creation date, section headings. Include these as chunk metadata — they significantly improve retrieval when users ask about specific documents or date ranges.
python
import re

def preprocess_extracted_text(text: str) -> str:
    # Fix common PDF ligature encodings
    ligatures = {"fi": "fi", "fl": "fl", "ff": "ff", "ffi": "ffi", "ffl": "ffl"}
    for ligature, replacement in ligatures.items():
        text = text.replace(ligature, replacement)

    # Rejoin hyphenated line breaks (common in justified PDF text)
    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)

    # Normalize whitespace
    text = re.sub(r"[ \t]+", " ", text)         # multiple spaces → single
    text = re.sub(r"\n{3,}", "\n\n", text)     # 3+ newlines → double newline

    # Remove page numbers (lines that are only a number)
    text = re.sub(r"^\s*\d+\s*$", "", text, flags=re.MULTILINE)

    return text.strip()

Detecting and Handling Scanned Pages

Routing every page through OCR is slow and expensive. Detecting which pages need OCR and which do not is a worthwhile optimization for any pipeline processing large document volumes:

  • Text density heuristic: if a page has fewer than 50 characters of extracted programmatic text, treat it as scanned.
  • Image-to-text ratio: if the page contains large image regions relative to its total area, it is likely partially or fully scanned.
  • Pre-OCR image enhancement: for low-quality scans, apply deskewing (correct page rotation), denoising, and contrast enhancement before passing to OCR. These steps can increase Tesseract accuracy by 20–40% on poor scans.

Table Extraction for RAG

Tables in PDFs are the hardest element to extract cleanly and the most important to get right for many enterprise use cases — financial statements, contract schedules, data appendices.

  • pdfplumber: best open-source option for table detection and extraction from programmatic PDFs. Returns tables as lists of rows.
  • AWS Textract AnalyzeDocument: handles both scanned and programmatic tables. Returns a structured table representation with cell positions.
  • Convert tables to Markdown for chunking: a table converted to Markdown preserves row/column relationships in plain text that the LLM can reason about. A table extracted as raw text with whitespace alignment loses all structure.
  • Index table rows as individual chunks for large tables: a 200-row financial table should not be a single chunk. Each row or group of rows with its column headers becomes a chunk.
Tip:Invest disproportionately in parsing quality for your specific document types. A 10% improvement in parsing quality translates directly to a 10% improvement in retrieval quality — and parsing runs once per document while retrieval runs on every query.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch