Skip to main content

Overview

PDFs are designed for printing, not data extraction. When a logical table spans multiple pages or is split across columns on a single page, most parsers output disconnected fragments — breaking the semantic integrity of the data and making it difficult for downstream LLMs and RAG pipelines to reason over. Tensorlake’s Agentic Table Merging reconstructs these fragments into a single coherent table by reasoning over content and context, not just geometry. Enable it with table_merging=True in your ParsingOptions.

Enabling Table Merging

Set table_merging=True in your ParsingOptions:
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import ParsingOptions

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_CLOUD_API_KEY")

file_id = doc_ai.upload(path="document.pdf")

parsing_options = ParsingOptions(
    table_merging=True,
)

parse_id = doc_ai.read(
    file_id=file_id,
    parsing_options=parsing_options,
)

result = doc_ai.wait_for_completion(parse_id)

How It Works

Rather than relying on geometric position alone, an agent analyzes the content and context around each table fragment to decide whether it is a continuation of the previous one. For each candidate pair, the agent examines:
  • The end of the previous table fragment
  • The text in the gap between them (e.g. "Page 14 of 92", "(continued)", boilerplate disclaimers)
  • The start of the next table fragment
  • Whether column structures are compatible (same number of columns, matching or repeated headers)
This allows the agent to ignore irrelevant footer noise while correctly identifying continuation cues. Two merge scenarios are handled:
  • Cross-page merges — tables that continue across one or more page breaks, often with repeated or noisy headers and footers
  • Same-page merges — tables split into multiple columns on a single page (e.g. an alphabetical list split left/right) that logically belong together

Output

When table merging is enabled, the parse result includes a merged_tables array. Each entry in the array represents a reconstructed table:
FieldDescription
merged_table_idUnique identifier for the merged table (e.g. cross_page_merge_1_3)
merged_table_htmlFull HTML representation of the unified table
start_pagePage number where the first fragment was found
end_pagePage number where the last fragment was found
pages_mergedNumber of pages spanned by the merged table
summaryHuman-readable summary of the merged table’s content
merge_actionsDetails on the pages involved and target column count
merged_atISO 8601 timestamp of when the merge was performed

Example: cross-page merge

A financial table spanning three pages is merged into a single entry:
{
  "merged_table_id": "cross_page_merge_1_3",
  "merged_table_html": "<table>...</table>",
  "start_page": 1,
  "end_page": 3,
  "pages_merged": 3,
  "summary": "Financial results for the quarter and nine months ended September 30, 2025...",
  "merge_actions": {
    "pages": [1, 2, 3],
    "target_columns": 10
  },
  "merged_at": "2026-01-10T03:12:10.785866+00:00"
}

Example: same-page column merge

A holdings table split into two columns on one page is unified into a single continuous structure:
{
  "merged_table_id": "same_page_merge_2_3",
  "merged_table_html": "<table>...</table>",
  "start_page": 2,
  "end_page": 2,
  "pages_merged": 1,
  "summary": "Both tables share the same column structure (Security, Shares, Value) and represent a continuous alphabetical list of stock holdings...",
  "merge_actions": {
    "pages": [2],
    "target_columns": null
  }
}

Common Use Cases

  • Financial documents — reconstruct multi-page income statements, balance sheets, and loan tables for accurate numeric reasoning
  • Research papers — unify results tables that span pages so LLMs can compare rows and compute aggregates
  • Portfolio and fund reports — merge holdings tables split across columns for reliable sector aggregation and exposure calculations
  • RAG pipelines — produce coherent table chunks that improve retrieval quality and reduce hallucinations on questions that depend on full table context