Page Classification enables you to automatically categorize pages within documents based on their content and purpose. This powerful feature allows you to selectively extract structured data from only the relevant pages, improving efficiency and accuracy for document processing workflows.

Overview

Page Classifications work by analyzing each page of a document and assigning it to one or more predefined categories that you specify. This is particularly useful for:

  • Multi-section documents: Contracts, reports, or forms with different types of content on different pages
  • Selective data extraction: Only extracting structured data from specific page types (e.g., signature pages, form pages)
  • Document routing: Processing different page types with different workflows
  • Content organization: Understanding the structure and layout of complex documents

How Page Classification works

When you initiate a parse job, you can provide a list of page classification configurations as part of your request. Each page classification configuration consists of:

  • Name: A unique identifier for the page class
  • Description: A detailed description that guides the AI model in identifying pages that belong to this category

During parsing, Tensorlake analyzes each page and determines which of your defined classifications apply. If you have specified page classifications, the parse results will include*:

  • A list of page classifications with the specific page numbers that match each category

*This is in addition to the rest of the output.

Basic Page Classification example

import time
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import ParseStatus
from tensorlake.documentai.models.options import (
    PageClassConfig
)

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_API_KEY")

# Define page classifications
page_classifications = [
    PageClassConfig(
        name="signature_page",
        description="Pages containing signatures, signature lines, or signature blocks"
    ),
    PageClassConfig(
        name="terms_and_conditions", 
        description="If the has Terms and Conditions as a section header, classify as terms_and_conditions"
    )
]

parse_id = doc_ai.parse(
    file="https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Fake_Terms_Conditions.pdf",
    page_classifications=page_classifications
)
print(f"Parse job submitted with ID: {parse_id}")

# Get the result
result = doc_ai.get_parsed_result(parse_id=parse_id)

while result.status in [ParseStatus.PENDING, ParseStatus.PROCESSING]:
    print("waiting 5s...")
    time.sleep(5)
    result = doc_ai.get_parsed_result(parse_id=parse_id)
    print(f"Parse status: {result.status.name}")

if result.status == ParseStatus.FAILURE :
    print(f"Parse job {parse_id} failed. Exiting")
    exit(1)

for page_classification in result.page_classes:
    print(f"Classification: {page_classification.page_class}")
    print(f"Page: {page_classification.page_numbers}")

Combining with Structured Extraction

You can combine page classification with structured data extraction to only extract data from specific page types. When you specify page classifications in your structured extraction options, Tensorlake will only extract structured data from pages that match those classifications.

Check out this Colab Notebook to see an example of combining Page Classification with Structured Extraction.

Understanding the Results

When you use page classification, the parse results include a page_classifications field that contains an array of classification results:

{
  "parse_id": "parse_xxxxx",
  "status": "success",
  "page_classes": {
    "terms_and_conditions": {
      "page_class": "terms_and_conditions",
      "page_numbers": [
        1
      ]
    },
    "signature_page": {
      "page_class": "signature_page",
      "page_numbers": [
        2
      ]
    }
  },
  // ... other parse results
}

Each classification result includes:

  • name: The classification name you provided
  • pages: An array of page numbers (1-indexed) that match this classification

Best Practices

Writing Effective Descriptions

The quality of your page classifications depends heavily on the descriptions you provide. Here are some tips:

Be Specific and Descriptive

# Good
PageClassConfig(
    name="financial_summary",
    description="Pages containing financial summaries, balance sheets, income statements, or tables with monetary values and financial metrics"
)

# Less effective  
PageClassConfig(
    name="financial_summary",
    description="Financial pages"
)

Include Visual and Content Cues

PageClassConfig(
    name="signature_page",
    description="Pages with signature lines, signature blocks, 'Sign here' text, or actual handwritten signatures. May include date fields next to signatures."
)

Mention Common Patterns

PageClassConfig(
    name="form_page",
    description="Pages with form fields, checkboxes, fill-in-the-blank sections, or structured input areas for data entry"
)

Common Use Cases

Insurance Claims Processing

page_classifications = [
    PageClassConfig(
        name="claim_form",
        description="Insurance claim forms with policy numbers, incident details, and claimant information"
    ),
    PageClassConfig(
        name="supporting_documents", 
        description="Supporting documentation like police reports, medical records, or receipts"
    ),
    PageClassConfig(
        name="photos_evidence",
        description="Pages containing photographs, images, or visual evidence of damages"
    )
]
page_classifications = [
    PageClassConfig(
        name="contract_terms",
        description="Main contract pages with terms, conditions, and legal clauses"
    ),
    PageClassConfig(
        name="signature_pages",
        description="Pages requiring signatures from parties, with signature lines and date fields"
    ),
    PageClassConfig(
        name="exhibits_attachments",
        description="Exhibits, attachments, or addendums referenced in the main contract"
    )
]

Financial Document Analysis

page_classifications = [
    PageClassConfig(
        name="executive_summary",
        description="Executive summary or overview pages with key financial highlights"
    ),
    PageClassConfig(
        name="financial_statements",
        description="Balance sheets, income statements, cash flow statements with numerical financial data"
    ),
    PageClassConfig(
        name="notes_disclosures",
        description="Footnotes, accounting policies, or disclosure pages explaining financial data"
    )
]

Page classification works with all supported document types including PDFs, Word documents, images, and more. The AI model analyzes both textual content and visual layout to make classification decisions.