Overview
Cross-page header correction analyzes header patterns across an entire document and corrects their hierarchy. OCR engines frequently misidentify header depth — a subsection labeled “2.2” might be emitted as a top-level header (##) instead of a nested one (###). This feature resolves those inconsistencies and detects headers that span page breaks without fragmentation.
Each corrected section_header fragment includes a level attribute that accurately reflects its depth in the document hierarchy (0 for #, 1 for ##, 2 for ###, etc.).
Enabling Header Correction
Setcross_page_header_detection=True in your ParsingOptions:
How It Works
When enabled, the pipeline:- Parses all pages and collects every
section_headerfragment across the document - Analyzes numbering patterns (e.g.
1.,1.1.,1.1.1.) and visual structure to infer correct depth - Assigns accurate
levelvalues to each header, overriding what the OCR engine reported - Detects headers that span page breaks and merges them into a single fragment
Fragment Output
Corrected headers are returned assection_header page fragments. Each fragment includes:
| Field | Description |
|---|---|
fragment_type | Always "section_header" for header fragments |
content.level | Integer representing header depth (0 = #, 1 = ##, 2 = ###, etc.) |
content.content | Clean header text without markdown formatting |
reading_order | Position of this fragment in reading order relative to other page fragments |
bbox | Bounding box coordinates (x1, y1, x2, y2) in page pixels |
Example fragment
Accessing Corrected Headers
Building a Document Outline
Use thelevel attribute to render a nested outline of the document:
Common Use Cases
- RAG pipelines — accurate header boundaries improve chunking quality and context preservation for retrieval
- Document outlines — build navigable tables of contents programmatically from any document
- Knowledge graphs — construct accurate document trees with correct parent-child header relationships
- Research paper processing — parse structured academic documents with multi-level section hierarchies