Skip to main content

Overview

Cross-page header correction analyzes header patterns across an entire document and corrects their hierarchy. OCR engines frequently misidentify header depth — a subsection labeled “2.2” might be emitted as a top-level header (##) instead of a nested one (###). This feature resolves those inconsistencies and detects headers that span page breaks without fragmentation. Each corrected section_header fragment includes a level attribute that accurately reflects its depth in the document hierarchy (0 for #, 1 for ##, 2 for ###, etc.).

Enabling Header Correction

Set cross_page_header_detection=True in your ParsingOptions:
from tensorlake.documentai import DocumentAI, ParsingOptions

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_CLOUD_API_KEY")

file_id = doc_ai.upload(path="document.pdf")

parsing_options = ParsingOptions(
    cross_page_header_detection=True,
)

parse_id = doc_ai.read(
    file_id=file_id,
    parsing_options=parsing_options,
)

result = doc_ai.wait_for_completion(parse_id)

How It Works

When enabled, the pipeline:
  1. Parses all pages and collects every section_header fragment across the document
  2. Analyzes numbering patterns (e.g. 1., 1.1., 1.1.1.) and visual structure to infer correct depth
  3. Assigns accurate level values to each header, overriding what the OCR engine reported
  4. Detects headers that span page breaks and merges them into a single fragment
For example, a document with an incorrectly leveled subsection:
# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
## 2.2. Statistical Analysis  ← Wrong level (should be ###)
## 3. Results
becomes:
# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
### 2.2. Statistical Analysis  ← Corrected
## 3. Results

Fragment Output

Corrected headers are returned as section_header page fragments. Each fragment includes:
FieldDescription
fragment_typeAlways "section_header" for header fragments
content.levelInteger representing header depth (0 = #, 1 = ##, 2 = ###, etc.)
content.contentClean header text without markdown formatting
reading_orderPosition of this fragment in reading order relative to other page fragments
bboxBounding box coordinates (x1, y1, x2, y2) in page pixels

Example fragment

{
  "fragment_type": "section_header",
  "content": {
    "level": 2,
    "content": "2.2. Statistical Analysis"
  },
  "reading_order": 5,
  "bbox": {
    "x1": 72,
    "y1": 310,
    "x2": 540,
    "y2": 328
  }
}

Accessing Corrected Headers

for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            print(f"Level {fragment.content.level}: {fragment.content.content}")
Example output:
Level 0: Effectiveness of ω-3 Polyunsaturated Fatty Acids...
Level 1: 1. Introduction
Level 1: 2. Materials and Methods
Level 2: 2.1. Subjects
Level 2: 2.2. Statistical Analysis
Level 1: 3. Results

Building a Document Outline

Use the level attribute to render a nested outline of the document:
for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            indent = "  " * fragment.content.level
            print(f"{indent}{fragment.content.content}")
Example output:
• Effectiveness of ω-3 Polyunsaturated Fatty Acids...
  • 1. Introduction
  • 2. Materials and Methods
    • 2.1. Subjects
    • 2.2. Statistical Analysis
  • 3. Results

Common Use Cases

  • RAG pipelines — accurate header boundaries improve chunking quality and context preservation for retrieval
  • Document outlines — build navigable tables of contents programmatically from any document
  • Knowledge graphs — construct accurate document trees with correct parent-child header relationships
  • Research paper processing — parse structured academic documents with multi-level section hierarchies