Parsing
API for parsing documents, chunking and layout analysis, and table extraction
The Document Parsing API parses a document and returns:
- A Markdown version of the document, optionally chunked into sections or fragments
- Structured data guided by a provided schema
- A JSON version of the document, which includes more details about the layout:
- Bounding boxes for each text element, table, figure, and page
- A dictionary of pages indexed by page number
- The layout type of individual text elements, tables, figures, and other page elements
- Tables encoded as Markdown
- Summary of figures and tables
Simple Example
The following code snippet shows how to parse a document using the Document Parsing API.
Upload the Document
Upload the document to the API.
Specify Parsing Options and Parse
Specify the parsing options for this file. Get a breakdown of Parsing Options below.
Get the Result
Get the results using the Tensorlake SDK. Learn more about the Job API to manage jobs in the Tensorlake Cloud.
Interacting with the Parsing API through the Python SDK
The following code snippets show how to parse documents using the Python SDK.
Basic file parsing
A job will immediately be placed in the pending
state. Every project has a limit on how many jobs can be in the processing
state at any given time.
If there is enough capacity, the job will transition to the processing
state and then to the successful
state.
Retrieving outputs
You can retrieve the result of the parsing using the job_id
returned from the parse
endpoint.
Outputs are available only once the job transitions to the successful
state.
Submitting a directory for document extraction
You can also submit a directory of files for document extraction using the Tensorlake Data Loader API. And since the loader returns an array
of documents, you can simply iterate through each of the documents, uploading each one to Tensorlake through the Files API, storing the file_id
as you upload into an array. Then, you can do the same with the Parse API, parsing each document stored in the file_id
arracy you just made
with doc_ai.parse
, saving the job_id
into an array. The job_id
array can then be iterated through to get the results of each job.
Structured Extraction
Document Ingestion can also extract structured data from documents. You can optionally pass in a schema into the parse endpoint, for it to extract information according to the schema from every page of the document.
For example, we can parse Loan information from any mortgage document.
Specify the schema in the extraction_options
parameter in the parse
endpoint.
Signature Detection
Tensorlake supports automated Signature Detection as part of the document parsing pipeline. This feature allows you to determine whether a signature is present on any page of a document, enabling validation checks, routing logic, and automation workflows.
Use Cases for Signature Detection
Explore use cases for signature detection.
When enabled, Signature Detection analyzes the visual content of each page and returns a boolean flag indicating whether a signature is likely present. Each page of the document will return this metadata if Signature Detection is enabled. Signature Detection works independently and in conjunction with structured extraction, table parsing, and form processing.
To enable Signature Detection, we recommend skipping OCR, since signatures will be skipped as not text during OCR. Then, make sure to set
detect_signature=True
:
Parse Options
The following options are available for the Document Parsing API. You can specify these options when you call the parse
endpoint.
URL: https://api.tensorlake.ai/documents/v1/parse
Output Modes
Attribute: outputMode
The Document Parsing API supports two output modes:
markdown
- Parses the document into Markdown, ideal for indexing and other text-based post-processing.json
- Parses the document into JSON, which provides more details about the document’s layout and its page elements.
Default: markdown
.
Chunking Strategy
Attribute: chunkStrategy
Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.
The supported strategies are:
none
- The entire parsed output is returned as one entity.page
- Output data is separated by each individual page using a dictionary indexed by page number.section
- The parsing model tries to detect logical sections and returns outputs separated by section.fragment
- The parsing model uses detected fragments (such as TextBox, Table, or Figure) to separate the parsed output.
Default: None.
Table Parsing Mode
Attribute: tableParsingMode
vlm
- Uses a VLM to parse the contents of the table. This method is suitable for visually complex tables but is prone to hallucinations.tsr
- Uses traditional OCR models for table detection and parsing. It is more accurate than VLM and works well for grid-structured tables.
Default: tsr
.
Form Detection Mode
Attribute: formDetectionMode
vlm
- Uses a VLM to identify questions and answers in a form. It does not provide bounding boxes for the questions and answers and is prone to hallucinations.objectDetection
- Uses a layout detector to identify questions and answers. It does not work well with very complex forms.
Default: objectDetection
.
Table Output Format
Attribute: tableOutputFormat
markdown
- Tables generated by the system follow Markdown syntax.html
- Tables are generated in HTML using the<table>
,<td>
,<th>
, and<tr>
tags.
Default: markdown
.
Table Summarization Prompt
Attribute: tableSummarizationPrompt
A prompt to help the model summarize the table. Table summarization is useful for encoding information about tables for RAG applications.
Default: Summarize the table to explain the contents for retrieval
Figure Summarization Prompt
Attribute: figureSummarizationPrompt
A prompt to help the model extract information from figures (e.g., images, diagrams, etc.). Figure summarization assists with retrieval.
Default: Summarize the figures to explain the contents for retrieval
JSON Schema
Attribute: jsonSchema
The JSON schema guides the structured extraction of data from the document. The schema is defined as a JSON object.
Default: null
.
Structured Extraction Prompt
Attribute: structuredExtractionPrompt
A prompt to help the model extract structured data from the document. The structured extraction prompt guides the model to extract specific information from the document.
Default: null
Model provider
Attribute: modelProvider
The model provider to use for parsing the document. The model provider is a string that specifies the model to use for parsing the document.
Default: tensorlake
.
Webhook Delivery
Attribute: deliverWebhook
You can configure the API to deliver a webhook when the parsing job finishes. A webhook needs to be configured for this to work.
Learn more about Webhooks.
Default: false
.