Tensorlake Documentation

Overview

Cross-page header correction analyzes header patterns across an entire document and corrects their hierarchy. OCR engines frequently misidentify header depth — a subsection labeled “2.2” might be emitted as a top-level header (##) instead of a nested one (###). This feature resolves those inconsistencies and detects headers that span page breaks without fragmentation. Each corrected section_header fragment includes a level attribute that accurately reflects its depth in the document hierarchy (0 for #, 1 for ##, 2 for ###, etc.).

Enabling Header Correction

Set cross_page_header_detection=True in your ParsingOptions:

from tensorlake.documentai import DocumentAI, ParsingOptions

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_CLOUD_API_KEY")

file_id = doc_ai.upload(path="document.pdf")

parsing_options = ParsingOptions(
    cross_page_header_detection=True,
)

parse_id = doc_ai.read(
    file_id=file_id,
    parsing_options=parsing_options,
)

result = doc_ai.wait_for_completion(parse_id)

How It Works

When enabled, the pipeline:

Parses all pages and collects every section_header fragment across the document
Analyzes numbering patterns (e.g. 1., 1.1., 1.1.1.) and visual structure to infer correct depth
Assigns accurate level values to each header, overriding what the OCR engine reported
Detects headers that span page breaks and merges them into a single fragment

For example, a document with an incorrectly leveled subsection:

# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
## 2.2. Statistical Analysis  ← Wrong level (should be ###)
## 3. Results

becomes:

# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
### 2.2. Statistical Analysis  ← Corrected
## 3. Results

Fragment Output

Corrected headers are returned as section_header page fragments. Each fragment includes:

Field	Description
`fragment_type`	Always `"section_header"` for header fragments
`content.level`	Integer representing header depth (0 = `#`, 1 = `##`, 2 = `###`, etc.)
`content.content`	Clean header text without markdown formatting
`reading_order`	Position of this fragment in reading order relative to other page fragments
`bbox`	Bounding box coordinates `(x1, y1, x2, y2)` in page pixels

Example fragment

{
  "fragment_type": "section_header",
  "content": {
    "level": 2,
    "content": "2.2. Statistical Analysis"
  },
  "reading_order": 5,
  "bbox": {
    "x1": 72,
    "y1": 310,
    "x2": 540,
    "y2": 328
  }
}

Accessing Corrected Headers

for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            print(f"Level {fragment.content.level}: {fragment.content.content}")

Example output:

Level 0: Effectiveness of ω-3 Polyunsaturated Fatty Acids...
Level 1: 1. Introduction
Level 1: 2. Materials and Methods
Level 2: 2.1. Subjects
Level 2: 2.2. Statistical Analysis
Level 1: 3. Results

Building a Document Outline

Use the level attribute to render a nested outline of the document:

for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            indent = "  " * fragment.content.level
            print(f"{indent}• {fragment.content.content}")

Example output:

• Effectiveness of ω-3 Polyunsaturated Fatty Acids...
  • 1. Introduction
  • 2. Materials and Methods
    • 2.1. Subjects
    • 2.2. Statistical Analysis
  • 3. Results

Common Use Cases

RAG pipelines — accurate header boundaries improve chunking quality and context preservation for retrieval
Document outlines — build navigable tables of contents programmatically from any document
Knowledge graphs — construct accurate document trees with correct parent-child header relationships
Research paper processing — parse structured academic documents with multi-level section hierarchies

​Overview

​Enabling Header Correction

​How It Works

​Fragment Output

​Example fragment

​Accessing Corrected Headers

​Building a Document Outline

​Common Use Cases

​Related