Document Parsing
API for parsing documents, chunking and layout analysis, and table extraction
The Document Parsing API parses a document and returns:
- A Markdown version of the document, optionally chunked into sections or fragments.
- A JSON version of the document, which includes more details about the layout:
- Bounding boxes for each text element, table, figure, and page.
- A dictionary of pages indexed by page number.
- The layout type of individual text elements, tables, figures, and other page elements.
- Tables encoded as LaTeX, CSV, or Markdown.
- Summarized tables in addition to the raw data.
- Figures processed by extracting any text or summarizing non-textual visual content.
- Supported file types: PDF, JPEG, PNG
Quick Start
Upload the Document
Upload the document to the API.
Parse the Document
Parse the document using the parse_async
endpoint.
Get the Result
Get the results using the Tensorlake SDK.
Structured Extraction
Document AI can also extract structured data from documents. You can optionally pass in a schema into the parse endpoint, for it to extract information according to the schema from every page of the document.
For example, we can parse Loan information from any mortgage document.
Specify the schema in the extraction_options
parameter in the parse
endpoint.
You can retrieve the result in the same way as before.
Python SDK
The Python SDK exposes APIs to integrate with the API from Python applications and workflows.
Submitting a file for document extraction
A job will immediately be placed in the pending
state. Every project has a limit on how many jobs can be in the processing
state at any given time.
If there is enough capacity, the job will transition to the processing
state and then to the successful
state.
Retrieving outputs
You can retrieve the result of the parsing using the job ID returned from the parse
endpoint.
Outputs are available only once the job transitions to the successful
state.
Submitting a directory for document extraction
Listing and deleting outputs
An output is a parsed document generated by the system. You can access the output through the job abstraction.
Listing Jobs
You can list all the jobs in a project using the following API call:
Deleting generated outputs
The generated data can be removed by deleting the job returned by the system.
Parse API Reference
URL: https://api.tensorlake.ai/documents/v1/parse
Output Modes
Attribute: outputMode
The Document Parsing API supports two output modes:
markdown
- Parses the document into Markdown, ideal for indexing and other text-based post-processing.json
- Parses the document into JSON, which provides more details about the document’s layout and its page elements.
Default: markdown
.
Chunking Strategy
Attribute: chunkStrategy
Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.
The supported strategies are:
none
- The entire parsed output is returned as one entity.page
- Output data is separated by each individual page using a dictionary indexed by page number.section
- The parsing model tries to detect logical sections and returns outputs separated by section.fragment
- The parsing model uses detected fragments (such as TextBox, Table, or Figure) to separate the parsed output.
Default: None.
Table Parsing Mode
Attribute: tableParsingMode
vlm
- Uses a VLM to parse the contents of the table. This method is suitable for visually complex tables but is prone to hallucinations.tsr
- Uses traditional OCR models for table detection and parsing. It is more accurate than VLM and works well for grid-structured tables.
Default: tsr
.
Form Detection Mode
Attribute: formDetectionMode
vlm
- Uses a VLM to identify questions and answers in a form. It does not provide bounding boxes for the questions and answers and is prone to hallucinations.objectDetection
- Uses a layout detector to identify questions and answers. It does not work well with very complex forms.
Default: objectDetection
.
Table Output Format
Attribute: tableOutputFormat
markdown
- Tables generated by the system follow Markdown syntax.html
- Tables are generated in HTML using the<table>
,<td>
,<th>
, and<tr>
tags.
Default: markdown
.
Table Summarization Prompt
Attribute: tableSummarizationPrompt
A prompt to help the model summarize the table. Table summarization is useful for encoding information about tables for RAG applications.
Default: Summarize the table to explain the contents for retrieval
Figure Summarization Prompt
Attribute: figureSummarizationPrompt
A prompt to help the model extract information from figures (e.g., images, diagrams, etc.). Figure summarization assists with retrieval.
Default: Summarize the figures to explain the contents for retrieval
Webhook Delivery
Attribute: deliverWebhook
You can configure the API to deliver a webhook when the parsing job finishes. Learn more about Webhooks.
Default: false
.