The Document Parsing API parses a document and returns:

  • A Markdown version of the document, optionally chunked into sections or fragments.
  • A JSON version of the document, which includes more details about the layout:
    • Bounding boxes for each text element, table, figure, and page.
    • A dictionary of pages indexed by page number.
    • The layout type of individual text elements, tables, figures, and other page elements.
  • Tables encoded as LaTeX, CSV, or Markdown.
  • Summarized tables in addition to the raw data.
  • Figures processed by extracting any text or summarizing non-textual visual content.
  • Supported file types: PDF, JPEG, PNG
All the code examples on this page use the official Tensorlake SDK. For other languages, please consult our API Reference.

Quick Start

1

Upload the Document

Upload the document to the API.

from tensorlake.documentai import DocumentAI, ParsingOptions, TableOutputMode, TableParsingStrategy

doc_ai = DocumentAI(api_key="xxxx")

file_id = doc_ai.upload(path="/path/to/finance_stock_report.pdf")
2

Parse the Document

Parse the document using the parse_async endpoint.

options = ParsingOptions(
    format=OutputFormat.MARKDOWN,
    table_output_mode=TableOutputMode.MARKDOWN,
    table_parsing_strategy=TableParsingStrategy.VLM,
    page_range='1-2'
)

job_id = doc_ai.parse(file_id, options)
3

Get the Result

Get the results using the Tensorlake SDK.

result = doc_ai.get_job(job_id)
while True:
    if result.status in ["pending", "processing"]:
        print("waiting 5s...")
        time.sleep(5)
        result = doc_ai.get_job(job_id)
        print(f"job status: {result.status}")
    else:
        if result.status == "successful":
            print(result)
            break

Structured Extraction

Document AI can also extract structured data from documents. You can optionally pass in a schema into the parse endpoint, for it to extract information according to the schema from every page of the document.

For example, we can parse Loan information from any mortgage document.

class LoanSchema(BaseModel):
    account_number: str = Field(description="Account number of the customer")
    customer_name: str = Field(description="Name of the customer")
    amount_due: str = Field(description="Total amount due in the current statement")
    due_data: str = Field(description="Due Date")

Specify the schema in the extraction_options parameter in the parse endpoint.

options = ParsingOptions(
    extraction_options=ExtractionOptions(
        schema=LoanSchema
    )
)

job_id = doc_ai.parse(file_id, options)

You can retrieve the result in the same way as before.

Python SDK

The Python SDK exposes APIs to integrate with the API from Python applications and workflows.

Every operation in our SDK provides an asynchronous implementation.

Submitting a file for document extraction

from tensorlake.documentai import DocumentAI

doc_ai = DocumentAI(api_key="xxxx")
file_id = doc_ai.upload("/path/to/file")
job_id = doc_ai.parse(file_id)

A job will immediately be placed in the pending state. Every project has a limit on how many jobs can be in the processing state at any given time.

If there is enough capacity, the job will transition to the processing state and then to the successful state.

You can also provide a publicly accessible URL to the parse operation.

Retrieving outputs

from tensorlake.documentai import DocumentAI
result = doc_ai.get_job(job_id)

You can retrieve the result of the parsing using the job ID returned from the parse endpoint.

Outputs are available only once the job transitions to the successful state.

Submitting a directory for document extraction

from tensorlake.data_loaders import LocalDirectoryLoader
from tensorlake.documentai import DocumentAI

loader = LocalDirectoryLoader("/path/to/files", file_extensions=[".pdf"])
loaded_files = loader.load()

doc_ai = DocumentAI(api_key="xxxx")

file_ids = []
for file in loaded_files:
    file_id = doc_ai.upload(file)
    file_ids.append(file_id)

jobs = []
for file_id in file_ids:
    job_id = doc_ai.parse(file_id)
    jobs.append(job_id)

Listing and deleting outputs

An output is a parsed document generated by the system. You can access the output through the job abstraction.

Listing Jobs

You can list all the jobs in a project using the following API call:

from tensorlake.documentai import DocumentAI, FileInfo
doc_ai = DocumentAI(api_key="xxxx")

jobs_page = doc_ai.jobs()

Deleting generated outputs

The generated data can be removed by deleting the job returned by the system.

from tensorlake.documentai import DocumentAI, FileInfo
doc_ai = DocumentAI(api_key="xxxx")
doc_ai.delete_job(job_id="job-XXX")

Parse API Reference

URL: https://api.tensorlake.ai/documents/v1/parse

Output Modes

Attribute: outputMode

The Document Parsing API supports two output modes:

  • markdown - Parses the document into Markdown, ideal for indexing and other text-based post-processing.
  • json - Parses the document into JSON, which provides more details about the document’s layout and its page elements.

Default: markdown.

Chunking Strategy

Attribute: chunkStrategy

Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.

The supported strategies are:

  • none - The entire parsed output is returned as one entity.
  • page - Output data is separated by each individual page using a dictionary indexed by page number.
  • section - The parsing model tries to detect logical sections and returns outputs separated by section.
  • fragment - The parsing model uses detected fragments (such as TextBox, Table, or Figure) to separate the parsed output.

Default: None.

Table Parsing Mode

Attribute: tableParsingMode

  • vlm - Uses a VLM to parse the contents of the table. This method is suitable for visually complex tables but is prone to hallucinations.
  • tsr - Uses traditional OCR models for table detection and parsing. It is more accurate than VLM and works well for grid-structured tables.

Default: tsr.

Form Detection Mode

Attribute: formDetectionMode

  • vlm - Uses a VLM to identify questions and answers in a form. It does not provide bounding boxes for the questions and answers and is prone to hallucinations.
  • objectDetection - Uses a layout detector to identify questions and answers. It does not work well with very complex forms.

Default: objectDetection.

Table Output Format

Attribute: tableOutputFormat

  • markdown - Tables generated by the system follow Markdown syntax.
  • html - Tables are generated in HTML using the <table>, <td>, <th>, and <tr> tags.

Default: markdown.

Table Summarization Prompt

Attribute: tableSummarizationPrompt

A prompt to help the model summarize the table. Table summarization is useful for encoding information about tables for RAG applications.

Default: Summarize the table to explain the contents for retrieval

Figure Summarization Prompt

Attribute: figureSummarizationPrompt

A prompt to help the model extract information from figures (e.g., images, diagrams, etc.). Figure summarization assists with retrieval.

Default: Summarize the figures to explain the contents for retrieval

Webhook Delivery

Attribute: deliverWebhook

You can configure the API to deliver a webhook when the parsing job finishes. Learn more about Webhooks.

Default: false.