> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Cross-page Header Correction

> Automatically detect and correct document header hierarchy across pages, even when OCR misidentifies header levels.

## Overview

Cross-page header correction analyzes header patterns across an entire document and corrects their hierarchy. OCR engines frequently misidentify header depth — a subsection labeled "2.2" might be emitted as a top-level header (`##`) instead of a nested one (`###`). This feature resolves those inconsistencies and detects headers that span page breaks without fragmentation.

Each corrected `section_header` fragment includes a `level` attribute that accurately reflects its depth in the document hierarchy (0 for `#`, 1 for `##`, 2 for `###`, etc.).

## Enabling Header Correction

Set `cross_page_header_detection=True` in your `ParsingOptions`:

<CodeGroup>
  ```python Python SDK theme={null}
  from tensorlake.documentai import DocumentAI, ParsingOptions

  doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_CLOUD_API_KEY")

  file_id = doc_ai.upload(path="document.pdf")

  parsing_options = ParsingOptions(
      cross_page_header_detection=True,
  )

  parse_id = doc_ai.read(
      file_id=file_id,
      parsing_options=parsing_options,
  )

  result = doc_ai.wait_for_completion(parse_id)
  ```

  ```bash curl theme={null}
  curl --request POST \
    --url https://api.tensorlake.ai/documents/v2/parse \
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
    --header 'Content-Type: application/json' \
    --data '{
      "file_id": "file_XXX",
      "parsing_options": {
        "cross_page_header_detection": true
      }
    }'
  ```
</CodeGroup>

## How It Works

When enabled, the pipeline:

1. Parses all pages and collects every `section_header` fragment across the document
2. Analyzes numbering patterns (e.g. `1.`, `1.1.`, `1.1.1.`) and visual structure to infer correct depth
3. Assigns accurate `level` values to each header, overriding what the OCR engine reported
4. Detects headers that span page breaks and merges them into a single fragment

For example, a document with an incorrectly leveled subsection:

```markdown theme={null}
# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
## 2.2. Statistical Analysis  ← Wrong level (should be ###)
## 3. Results
```

becomes:

```markdown theme={null}
# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
### 2.2. Statistical Analysis  ← Corrected
## 3. Results
```

## Fragment Output

Corrected headers are returned as `section_header` page fragments. Each fragment includes:

| Field             | Description                                                                 |
| ----------------- | --------------------------------------------------------------------------- |
| `fragment_type`   | Always `"section_header"` for header fragments                              |
| `content.level`   | Integer representing header depth (0 = `#`, 1 = `##`, 2 = `###`, etc.)      |
| `content.content` | Clean header text without markdown formatting                               |
| `reading_order`   | Position of this fragment in reading order relative to other page fragments |
| `bbox`            | Bounding box coordinates `(x1, y1, x2, y2)` in page pixels                  |

### Example fragment

```json theme={null}
{
  "fragment_type": "section_header",
  "content": {
    "level": 2,
    "content": "2.2. Statistical Analysis"
  },
  "reading_order": 5,
  "bbox": {
    "x1": 72,
    "y1": 310,
    "x2": 540,
    "y2": 328
  }
}
```

## Accessing Corrected Headers

```python theme={null}
for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            print(f"Level {fragment.content.level}: {fragment.content.content}")
```

Example output:

```
Level 0: Effectiveness of ω-3 Polyunsaturated Fatty Acids...
Level 1: 1. Introduction
Level 1: 2. Materials and Methods
Level 2: 2.1. Subjects
Level 2: 2.2. Statistical Analysis
Level 1: 3. Results
```

## Building a Document Outline

Use the `level` attribute to render a nested outline of the document:

```python theme={null}
for page in result.outputs.document.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "section_header":
            indent = "  " * fragment.content.level
            print(f"{indent}• {fragment.content.content}")
```

Example output:

```
• Effectiveness of ω-3 Polyunsaturated Fatty Acids...
  • 1. Introduction
  • 2. Materials and Methods
    • 2.1. Subjects
    • 2.2. Statistical Analysis
  • 3. Results
```

## Common Use Cases

* **RAG pipelines** — accurate header boundaries improve chunking quality and context preservation for retrieval
* **Document outlines** — build navigable tables of contents programmatically from any document
* **Knowledge graphs** — construct accurate document trees with correct parent-child header relationships
* **Research paper processing** — parse structured academic documents with multi-level section hierarchies

## Related

* [Parsing Overview](/document-ingestion/parsing/read)
* [Parse Output](/document-ingestion/parsing/parse-output)
* [Sample Notebook: Header Detection](https://tlake.link/notebooks/header-correction)
