The Document Parsing API parses a document and returns:

  • A Markdown version of the document, optionally chunked into sections or fragments
  • Structured data guided by a provided schema
  • A JSON version of the document, which includes more details about the layout:
    • Bounding boxes for each text element, table, figure, and page
    • A dictionary of pages indexed by page number
    • The layout type of individual text elements, tables, figures, and other page elements
  • Tables encoded as Markdown
  • Summary of figures and tables
All the code examples on this page use the official Tensorlake Python SDK. For other languages, please consult our API Reference.

Simple Example

The following code snippet shows how to parse a document using the Document Parsing API.

1

Upload the Document

Upload the document to the API.

upload_and_parse_file.py
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.parse import ( 
    ParsingOptions, 
    ExtractionOptions,
    TableOutputMode, 
    TableParsingStrategy
)

doc_ai = DocumentAI(api_key="xxxx")

file_id = doc_ai.upload(path="/path/to/file.pdf")
2

Specify Parsing Options and Parse

Specify the parsing options for this file. Get a breakdown of Parsing Options below.

options = ParsingOptions(
    table_output_mode=TableOutputMode.MARKDOWN,
    table_parsing_strategy=TableParsingStrategy.VLM,
    extraction_options=ExtractionOptions(
        provider="tensorlake",
    ),
    page_range='1-2'
)

job_id = doc_ai.parse(file_id, options)
3

Get the Result

Get the results using the Tensorlake SDK. Learn more about the Job API to manage jobs in the Tensorlake Cloud.

result = doc_ai.get_job(job_id)
while True:
    if result.status in ["pending", "processing"]:
        print("waiting 5s...")
        time.sleep(5)
        result = doc_ai.get_job(job_id)
        print(f"job status: {result.status}")
    else:
        if result.status == "successful":
            print(result)
            break

Interacting with the Parsing API through the Python SDK

The following code snippets show how to parse documents using the Python SDK.

Basic file parsing

from tensorlake.documentai import DocumentAI

doc_ai = DocumentAI(api_key="xxxx")
file_id = doc_ai.upload("/path/to/file")
job_id = doc_ai.parse(file_id)

A job will immediately be placed in the pending state. Every project has a limit on how many jobs can be in the processing state at any given time.

If there is enough capacity, the job will transition to the processing state and then to the successful state.

You can also provide a publicly accessible URL to the parse operation.

Retrieving outputs

You can retrieve the result of the parsing using the job_id returned from the parse endpoint.

from tensorlake.documentai import DocumentAI

result = doc_ai.get_job(job_id)
To learn more about the job API, check out the Jobs page.

Outputs are available only once the job transitions to the successful state.

Submitting a directory for document extraction

You can also submit a directory of files for document extraction using the Tensorlake Data Loader API. And since the loader returns an array of documents, you can simply iterate through each of the documents, uploading each one to Tensorlake through the Files API, storing the file_id as you upload into an array. Then, you can do the same with the Parse API, parsing each document stored in the file_id arracy you just made with doc_ai.parse, saving the job_id into an array. The job_id array can then be iterated through to get the results of each job.

from tensorlake.data_loaders import LocalDirectoryLoader
from tensorlake.documentai import DocumentAI

loader = LocalDirectoryLoader("/path/to/files", file_extensions=[".pdf"])
loaded_files = loader.load()

doc_ai = DocumentAI(api_key="xxxx")

file_ids = []
for file in loaded_files:
    file_id = doc_ai.upload(file)
    file_ids.append(file_id)

jobs = []
for file_id in file_ids:
    job_id = doc_ai.parse(file_id)
    jobs.append(job_id)

Structured Extraction

Document Ingestion can also extract structured data from documents. You can optionally pass in a schema into the parse endpoint, for it to extract information according to the schema from every page of the document.

For example, we can parse Loan information from any mortgage document.

class LoanSchema(BaseModel):
    account_number: str = Field(description="Account number of the customer")
    customer_name: str = Field(description="Name of the customer")
    amount_due: str = Field(description="Total amount due in the current statement")
    due_data: str = Field(description="Due Date")

Specify the schema in the extraction_options parameter in the parse endpoint.

options = ParsingOptions(
    extraction_options=ExtractionOptions(
        schema=LoanSchema
    )
)

job_id = doc_ai.parse(file_id, options)
Learn how to retreive the structured data from the job result with the Jobs API.

Signature Detection

Tensorlake supports automated Signature Detection as part of the document parsing pipeline. This feature allows you to determine whether a signature is present on any page of a document, enabling validation checks, routing logic, and automation workflows.

Use Cases for Signature Detection

Explore use cases for signature detection.

When enabled, Signature Detection analyzes the visual content of each page and returns a boolean flag indicating whether a signature is likely present. Each page of the document will return this metadata if Signature Detection is enabled. Signature Detection works independently and in conjunction with structured extraction, table parsing, and form processing.

To enable Signature Detection, we recommend skipping OCR, since signatures will be skipped as not text during OCR. Then, make sure to set detect_signature=True:

from tensorlake.documentai.parse import ExtractionOptions, ParsingOptions

options = ParsingOptions(
    extraction_options=ExtractionOptions(
        skip_ocr=True,
    ),
    detect_signature=True,
)

Parse Options

The following options are available for the Document Parsing API. You can specify these options when you call the parse endpoint.

from tensorlake.documentai import DocumentAI
from tensorlake.documentai.parse import ( 
    ParsingOptions, 
    TableOutputMode, 
    TableParsingStrategy, 
    ChunkingStrategy, 
    FormDetectionMode, 
    ExtractionOptions 
)

URL: https://api.tensorlake.ai/documents/v1/parse

Output Modes

Attribute: outputMode

The Document Parsing API supports two output modes:

  • markdown - Parses the document into Markdown, ideal for indexing and other text-based post-processing.
  • json - Parses the document into JSON, which provides more details about the document’s layout and its page elements.

Default: markdown.

Chunking Strategy

Attribute: chunkStrategy

Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.

The supported strategies are:

  • none - The entire parsed output is returned as one entity.
  • page - Output data is separated by each individual page using a dictionary indexed by page number.
  • section - The parsing model tries to detect logical sections and returns outputs separated by section.
  • fragment - The parsing model uses detected fragments (such as TextBox, Table, or Figure) to separate the parsed output.

Default: None.

Table Parsing Mode

Attribute: tableParsingMode

  • vlm - Uses a VLM to parse the contents of the table. This method is suitable for visually complex tables but is prone to hallucinations.
  • tsr - Uses traditional OCR models for table detection and parsing. It is more accurate than VLM and works well for grid-structured tables.

Default: tsr.

Form Detection Mode

Attribute: formDetectionMode

  • vlm - Uses a VLM to identify questions and answers in a form. It does not provide bounding boxes for the questions and answers and is prone to hallucinations.
  • objectDetection - Uses a layout detector to identify questions and answers. It does not work well with very complex forms.

Default: objectDetection.

Table Output Format

Attribute: tableOutputFormat

  • markdown - Tables generated by the system follow Markdown syntax.
  • html - Tables are generated in HTML using the <table>, <td>, <th>, and <tr> tags.

Default: markdown.

Table Summarization Prompt

Attribute: tableSummarizationPrompt

A prompt to help the model summarize the table. Table summarization is useful for encoding information about tables for RAG applications.

Default: Summarize the table to explain the contents for retrieval

Figure Summarization Prompt

Attribute: figureSummarizationPrompt

A prompt to help the model extract information from figures (e.g., images, diagrams, etc.). Figure summarization assists with retrieval.

Default: Summarize the figures to explain the contents for retrieval

JSON Schema

Attribute: jsonSchema

The JSON schema guides the structured extraction of data from the document. The schema is defined as a JSON object.

Default: null.

Structured Extraction Prompt

Attribute: structuredExtractionPrompt

A prompt to help the model extract structured data from the document. The structured extraction prompt guides the model to extract specific information from the document.

Default: null

Model provider

Attribute: modelProvider

The model provider to use for parsing the document. The model provider is a string that specifies the model to use for parsing the document.

Default: tensorlake.

Webhook Delivery

Attribute: deliverWebhook

You can configure the API to deliver a webhook when the parsing job finishes. A webhook needs to be configured for this to work.

Learn more about Webhooks.

Default: false.