The Document Parsing Problem Space
Not all documents are the same, and the right parsing approach depends entirely on what you are dealing with:
- Programmatic PDFs: Created by software (Word, LaTeX, PDF printers). Text is embedded as machine-readable characters. PyMuPDF or pdfplumber extract clean text reliably.
- Scanned PDFs: Images of physical documents. No embedded text. Require OCR to extract any content at all.
- Mixed PDFs: Partly programmatic, partly scanned — common in document-heavy industries where old files were digitized. Need to detect and handle both modes.
- PDFs with complex layouts: Multi-column text, tables, forms, headers and footers. Even programmatic PDFs require layout-aware parsing to extract text in reading order.
- Office documents (DOCX, XLSX, PPTX): Typically easier than PDFs, but contain their own structure quirks — nested tables, comments, tracked changes.
Parser Selection by Document Type
- PyMuPDF (fitz): Fastest programmatic PDF parser. Excellent text extraction, good metadata access, handles encrypted PDFs. First choice for programmatic PDFs.
- pdfplumber: Better table extraction than PyMuPDF, slower. Use it when tables are the primary content.
- Tesseract OCR: Open-source OCR, reliable for clean scanned documents. Accuracy degrades significantly on low-resolution or heavily skewed scans.
- AWS Textract / Google Document AI: Cloud OCR with layout understanding — detects tables, forms, key-value pairs. Significantly more accurate than Tesseract on complex layouts, at a cost.
- Unstructured.io: Open-source library that handles multiple document types (PDF, DOCX, HTML, images) with a unified API and built-in OCR routing. Good default for pipelines that process mixed file types.
- LlamaParse: LlamaIndex's document parser, optimized for RAG use cases. Uses vision models for layout understanding. Best quality for complex multi-column and table-heavy documents.
import fitz # PyMuPDF
from PIL import Image
import pytesseract
import io
def parse_pdf(path: str) -> str:
doc = fitz.open(path)
pages = []
for page in doc:
# Try programmatic text extraction first
text = page.get_text("text")
if len(text.strip()) < 50:
# Page is likely scanned — fall back to OCR
pix = page.get_pixmap(dpi=300)
img = Image.open(io.BytesIO(pix.tobytes("png")))
text = pytesseract.image_to_string(img)
pages.append(text)
return "\n\n".join(pages)The Preprocessing Steps That Actually Matter
Raw extracted text is rarely ready for a RAG pipeline. These preprocessing steps have the highest impact on downstream retrieval quality:
- Header and footer removal: Most PDFs repeat document title, page number, and section headings on every page. These repeat in every chunk and confuse retrieval. Detect repeating lines across pages and strip them.
- Whitespace normalization: OCR output and PDF text extraction both produce inconsistent whitespace — multiple spaces, inconsistent line breaks, hyphenated words split across lines. Normalize before chunking.
- Table extraction to structured format: Tables parsed as raw text lose their structure. Extract tables as markdown or CSV and store them separately, or use a layout-aware parser that preserves tabular structure.
- Ligature and encoding fixes: PDFs frequently encode ligatures (fi, fl, ff) as special characters that appear as '?' or garbage after extraction. Map common ligatures back to their ASCII equivalents.
- Metadata extraction: Document title, author, creation date, section headings. Include these as chunk metadata — they significantly improve retrieval when users ask about specific documents or date ranges.
import re
def preprocess_extracted_text(text: str) -> str:
# Fix common PDF ligature encodings
ligatures = {"fi": "fi", "fl": "fl", "ff": "ff", "ffi": "ffi", "ffl": "ffl"}
for ligature, replacement in ligatures.items():
text = text.replace(ligature, replacement)
# Rejoin hyphenated line breaks (common in justified PDF text)
text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
# Normalize whitespace
text = re.sub(r"[ \t]+", " ", text) # multiple spaces → single
text = re.sub(r"\n{3,}", "\n\n", text) # 3+ newlines → double newline
# Remove page numbers (lines that are only a number)
text = re.sub(r"^\s*\d+\s*$", "", text, flags=re.MULTILINE)
return text.strip()Detecting and Handling Scanned Pages
Routing every page through OCR is slow and expensive. Detecting which pages need OCR and which do not is a worthwhile optimization for any pipeline processing large document volumes:
- Text density heuristic: if a page has fewer than 50 characters of extracted programmatic text, treat it as scanned.
- Image-to-text ratio: if the page contains large image regions relative to its total area, it is likely partially or fully scanned.
- Pre-OCR image enhancement: for low-quality scans, apply deskewing (correct page rotation), denoising, and contrast enhancement before passing to OCR. These steps can increase Tesseract accuracy by 20–40% on poor scans.
Table Extraction for RAG
Tables in PDFs are the hardest element to extract cleanly and the most important to get right for many enterprise use cases — financial statements, contract schedules, data appendices.
- pdfplumber: best open-source option for table detection and extraction from programmatic PDFs. Returns tables as lists of rows.
- AWS Textract AnalyzeDocument: handles both scanned and programmatic tables. Returns a structured table representation with cell positions.
- Convert tables to Markdown for chunking: a table converted to Markdown preserves row/column relationships in plain text that the LLM can reason about. A table extracted as raw text with whitespace alignment loses all structure.
- Index table rows as individual chunks for large tables: a 200-row financial table should not be a single chunk. Each row or group of rows with its column headers becomes a chunk.