Document Parsing
API for parsing documents, chunking and layout analysis, and table extraction
The Document Parsing API parses a document and returns:
- A Markdown version of the document, optionally chunked into sections or fragments.
- Structured data guided by a provided schema.
- A JSON version of the document, which includes more details about the layout:
- Bounding boxes for each text element, table, figure, and page.
- A dictionary of pages indexed by page number.
- The layout type of individual text elements, tables, figures, and other page elements.
- Tables encoded as Markdown.
- Summary of figures and tables.
- Supported file types: PDF, JPEG, PNG
Quick Start
Upload the Document
Upload the document to the API.
Parse the Document
Parse the document using the parse_async
endpoint.
Get the Result
Get the results using the Tensorlake SDK.
Python SDK
The Python SDK exposes APIs to integrate with the API from Python applications and workflows.
Submitting a file for document extraction
A job will immediately be placed in the pending
state. Every project has a limit on how many jobs can be in the processing
state at any given time.
If there is enough capacity, the job will transition to the processing
state and then to the successful
state.
Retrieving outputs
You can retrieve the result of the parsing using the job ID returned from the parse
endpoint.
Outputs are available only once the job transitions to the successful
state.
Submitting a directory for document extraction
Listing and deleting outputs
An output is a parsed document generated by the system. You can access the output through the job abstraction.
Listing Jobs
You can list all the jobs in a project using the following API call:
Deleting generated outputs
The generated data can be removed by deleting the job returned by the system.
Structured Extraction
Document Ingestion can also extract structured data from documents. You can optionally pass in a schema into the parse endpoint, for it to extract information according to the schema from every page of the document.
For example, we can parse Loan information from any mortgage document.
Specify the schema in the extraction_options
parameter in the parse
endpoint.
You can retrieve the result in the same way as before.
Parse API Reference
URL: https://api.tensorlake.ai/documents/v1/parse
Output Modes
Attribute: outputMode
The Document Parsing API supports two output modes:
markdown
- Parses the document into Markdown, ideal for indexing and other text-based post-processing.json
- Parses the document into JSON, which provides more details about the document’s layout and its page elements.
Default: markdown
.
Chunking Strategy
Attribute: chunkStrategy
Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.
The supported strategies are:
none
- The entire parsed output is returned as one entity.page
- Output data is separated by each individual page using a dictionary indexed by page number.section
- The parsing model tries to detect logical sections and returns outputs separated by section.fragment
- The parsing model uses detected fragments (such as TextBox, Table, or Figure) to separate the parsed output.
Default: None.
Table Parsing Mode
Attribute: tableParsingMode
vlm
- Uses a VLM to parse the contents of the table. This method is suitable for visually complex tables but is prone to hallucinations.tsr
- Uses traditional OCR models for table detection and parsing. It is more accurate than VLM and works well for grid-structured tables.
Default: tsr
.
Form Detection Mode
Attribute: formDetectionMode
vlm
- Uses a VLM to identify questions and answers in a form. It does not provide bounding boxes for the questions and answers and is prone to hallucinations.objectDetection
- Uses a layout detector to identify questions and answers. It does not work well with very complex forms.
Default: objectDetection
.
Table Output Format
Attribute: tableOutputFormat
markdown
- Tables generated by the system follow Markdown syntax.html
- Tables are generated in HTML using the<table>
,<td>
,<th>
, and<tr>
tags.
Default: markdown
.
Table Summarization Prompt
Attribute: tableSummarizationPrompt
A prompt to help the model summarize the table. Table summarization is useful for encoding information about tables for RAG applications.
Default: Summarize the table to explain the contents for retrieval
Figure Summarization Prompt
Attribute: figureSummarizationPrompt
A prompt to help the model extract information from figures (e.g., images, diagrams, etc.). Figure summarization assists with retrieval.
Default: Summarize the figures to explain the contents for retrieval
JSON Schema
Attribute: jsonSchema
The JSON schema guides the structured extraction of data from the document. The schema is defined as a JSON object.
Default: null
.
Structured Extraction Prompt
Attribute: structuredExtractionPrompt
A prompt to help the model extract structured data from the document. The structured extraction prompt guides the model to extract specific information from the document.
Default: null
Model provider
Attribute: modelProvider
The model provider to use for parsing the document. The model provider is a string that specifies the model to use for parsing the document.
Default: tensorlake
.
Webhook Delivery
Attribute: deliverWebhook
You can configure the API to deliver a webhook when the parsing job finishes. A webhook needs to be configured for this to work.
Learn more about Webhooks.
Default: false
.