The Document Parsing API parses a document, and returns -

  • Markdown form of the document, and optionally chunk the document into sections or fragments.
  • JSON form of the document, which has more details about the layout of the document.
    • Bounding Boxes for each text, table, figure and page.
    • Page number indexed dictionary of the document.
    • The layout type of individual text, table, figure and other elements on the pages.
  • Tables are encoded as either LaTeX, CSV or Markdown.
  • Tables are summarized in addition to the raw data.
  • Figures are handled by extracting any text or summarizing non-textual visual content.
  • Supported file types - PDF, JPEG, PNG

Quick Start

1

Upload the Document

Upload the document to the API.

import requests
headers = {'Authorization': 'Bearer tl_apiKey_XXX'}
path = '/path/to/finance_stock_report.pdf'
files={'file': ('finance_stock_report.pdf', open(path, 'rb'), 'application/pdf')}
response = requests.post("https://api.tensorlake.ai/documents/v1/files",headers=headers, files=files)
response.json()

This returns a file ID, which is used to parse the document.

{
    "filename":"tensorlake://ff5e96bf-XXX",
    "message":"file uploaded successfully"
}
2

Parse the Document

Parse the document using the parse_async endpoint.

import requests
url = "https://api.tensorlake.ai/documents/v1/parse_async"
payload = {
    "chunkStrategy": "page",
    "file": "tensorlake://ff5e96bf-XXX",
    "outputMode": "markdown",
    "parseMode": "fast",
    "pages": "1-2",
    "deliverWebhook": True, 
}
headers = {
    "Authorization": "Bearer tl_apiKey_XXX",
    "Content-Type": "application/json"
}
response = requests.request("POST", url, json=payload, headers=headers)

This returns a job ID, which is used to get the result of the parsing.

{
    "jobId":"job-XXX",
    "status":"PROCESSING"
}
3

Get the Result

Get the result from the API.

import requests
url = "https://api.tensorlake.ai/documents/v1/jobs/job-XXX"
headers = {"Authorization": "Bearer tl_apiKey_XXX"}
response = requests.request("GET", url, headers=headers)

This returns the result of the parsing.

{
    "jobId":"job-XXXX",
    "chunks": ["chunk1","chunk2","chunk3"],
    "status":"SUCCESSFUL"
}

File Upload API

You can upload a file before parsing it. The file is returned with a tensorlake:// URL, which can be used to parse the document.

You can also provide a pre-signed S3 URL or publicly accessible URL to the parse endpoint.

Listing Uploaded Files

You can list all the documents in a project using the following API call:

curl -X GET https://api.tensorlake.ai/documents/v1/files\
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
-d '{
    "cursor": "xxx", 
    "limit": 10
}'

Parse API Reference

URL: https://api.tensorlake.ai/documents/v1/parse_async

Output Modes

Attribute: outputMode

The Document Parsing API supports two output modes:

  • markdown - Parses the document into markdown, ideal for indexing and other text based post-processing.
  • json - Parses the document into JSON, which has more details about the layout of the document, and details about the page elements.

Chunking Strategy

Attribute: chunkStrategy

Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.

The supported strategies are,

  • None - This is the default option the entire parsed output is returned as one entity.
  • Page - Output data is separated by each individual page using a page number indexed dictionary.
  • Section - The parsing model tries to detect a logical section and returns outputs separated by section.
  • Fragment - The parsing model uses detected fragments (TextBox, Table or Figure) to separate the parsed output.

Switching between Models

Attribute: parseMode

You can trade off between speed and accuracy while parsing.

  • fast: Faster and smaller models are used for low latency parsing.
  • accurate: Slightly slower, uses combinations of vision and VLMs, but more accurate.

Webhook Delivery

Attribute: deliverWebhook

You can configure the API to deliver a webhook when the parsing job finishes. Learn more about Webhooks.

Retrieving the Result

You can retrieve the result of the parsing using the jobId returned from the parse_async endpoint.

status: PROCESSING, SUCCESSFUL or FAILED.

chunks: List of chunks returned from the parsing step.

Listing and Deleting Jobs

Listing Jobs

You can list all the jobs in a project using the following API call:

curl -X GET https://api.tensorlake.ai/documents/v1/jobs\
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
-d '{
    "cursor": "xxx", 
    "limit": 10
}'

DeletingJobs

You can delete a Job and associated extracted data once you have downloaded the data.

curl -X DELETE https://api.tensorlake.ai/documents/v1/jobs/job-XXX \
-H "Authorization: Bearer YOUR_API_KEY"