Reading Documents
Understand how to convert PDF Documents to Markdown for use in AI Agents
The Document Parsing API parses a document and returns:
- A Markdown version of the document and optionally chunks it.
- Tables are encoded as Markdown or HTML.
- Get Document Layout details like bounding boxes, page numbers, etc.
Core Workflow of Document Parsing
- Upload the file to Tensorlake or use a publicly accessible URL.
- Choose the configuration for the parsing job. The main options are:
- Pages to parse - Select the range of pages to parse. Default: All pages.
- Table Output Mode - Choose between Markdown or HTML. We find that HTML is a more robust format for encoding tables from documents as text, because they preserve the structure of the table better when tables contain merged cells or complex headers. Default: HTML
- Table Parsing Strategy - Choose between TSR or VLM. The TSR(Table Structure Recognition) strategy is more robust for tables with merged cells or complex headers. VLM mode is a hail-mary approach to parsing very random tables using Vision Language Models. They often work surprisingly well on tables which don’t have a lot of structure, and can fail on long and dense tables. Default: TSR
- Chunking Strategy - Choose between None, Page, Section, or Fragment. By default, we don’t apply any chunking, the whole document is returned as a single markdown document. Default: None.
- Remove Strikethrough lines - Remove lines that are struck through consistently. OCR Models often remove these lines naturally, but we have found that they are not consistent. Setting this to true will remove these lines consistently. Default: False.
- Deliver Webhook - Deliver a webhook when the parsing job finishes. A webhook needs to be configured for this to work. Learn more about Webhooks. Default: False.
Guide to Parsing
The HTTP API for parsing is thoroughly documented here.
If you are using the Python SDK, all the configuration options described above are expressed through
the ParsingOptions
class.
This creates a new document parsing job, in a pending
state. It will transition to the processing
state and then to the successful
state when it’s parsed successfully.
Parsed Document
The parsed Document can be retrieved using the /jobs/{job_id}
endpoint, or using the get_job
API
of the Python SDK.
The response is a JSON object if you are using the HTTP API, and a Job
object if you are using the Python SDK.
The outputs
attribute of the response contains the following fields which returns the parsed document.
chunks - The chunks of the document.
num_pages - The number of pages in the document.
document - The layout of the document, which contains the bounding boxes, names of the objects, such as tables, text, figures, etc.
errors - The errors encountered while parsing the document.
The Outputs class has been documented in the Python SDK and in the HTTP API.
Markdown Chunks
The markdown content of the Document is available in the chunks
attribute of the JSON response. The number of chunks
depends on the chunking strategy you chose.
Chunking Strategy
- None - The whole document is returned as a single chunk. This allows you to use your own chunking logic.
- Page - Each page is returned as a separate chunk. You should receive as many chunks as the number of pages in the document.
- Section - The document is split into chunks based on the section headers detected in the document.
- Fragment - Every object like a table, figure, paragraph, etc, are returned as a separate chunk. You will most likely have to merge these chunks based on your use-case.
Layout and Bounding Boxes
The Document Layout has the following attributes:
pages - The list of pages, encoded as a JSON object.
Page
Each page has the following attributes:
page_number - The page number of the page.
dimensions - The width and height of the page in pixels.
page_fragments - The list of objects on the page.
Page Fragments
Each page fragment has the following attributes:
fragment_type - The type of the object. Currently we detect the following types:
- Section Header
- Text
- Table
- Form
- Formula
- Figure
- Signature
reading_order - The reading order of the page fragments. This is the order in which the fragment would be read by a human.
bounding_box - The bounding box of the page fragment, in the format [x1, y1, x2, y2]
.
Advanced Capabilities
This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for -
- Structured Data Extraction
- Summarizing Tables, Figures and charts.
- Signature Detection
- Page Classification