Classify pages using semantic descriptions in natural language
Page Classification enables you to automatically categorize pages within documents based on their content.
This allows you to label pages to filter them for downstream use-cases like structured extraction and OCR.
Page Classifications work by analyzing each page of a document and assigning it to one or more predefined categories that you
specify. When you initiate a parse job, you can provide a list of page classification configurations as part of your request. Each page
classification configuration consists of:
Name: A unique identifier for the page class
Description: A detailed description that guides the AI model in identifying pages that belong to this category
If you have specified page classes in your parse request, Tensorlake analyzes each page of the document and assigns it to one or more categories that you specify.
You can combine page classification with structured data extraction to only extract data from specific page types.
This allows speeding up structured extraction in long documents, and often improves accuracy of extraction.Check out this Colab Notebook to
see an example of combining Page Classification with Structured Extraction.
The quality of your page classifications depends heavily on the descriptions you provide. Here are some tips:Be specific and descriptive
Copy
Ask AI
# GoodPageClassConfig( name="financial_summary", description="Pages containing financial summaries, balance sheets, income statements, or tables with monetary values and financial metrics")# Less effective PageClassConfig( name="financial_summary", description="Financial pages")
Include visual and content cues
Copy
Ask AI
PageClassConfig( name="signature_page", description="Pages with signature lines, signature blocks, 'Sign here' text, or actual handwritten signatures. May include date fields next to signatures.")
Mention common patterns
Copy
Ask AI
PageClassConfig( name="form_page", description="Pages with form fields, checkboxes, fill-in-the-blank sections, or structured input areas for data entry")
page_classifications = [ PageClassConfig( name="contract_terms", description="Main contract pages with terms, conditions, and legal clauses" ), PageClassConfig( name="signature_pages", description="Pages requiring signatures from parties, with signature lines and date fields" ), PageClassConfig( name="exhibits_attachments", description="Exhibits, attachments, or addendums referenced in the main contract" )]
page_classifications = [ PageClassConfig( name="executive_summary", description="Executive summary or overview pages with key financial highlights" ), PageClassConfig( name="financial_statements", description="Balance sheets, income statements, cash flow statements with numerical financial data" ), PageClassConfig( name="notes_disclosures", description="Footnotes, accounting policies, or disclosure pages explaining financial data" )]
Page classification works with all supported document types including PDFs, Word documents, images, and more. The AI model analyzes both textual content and visual layout to make classification decisions.