The Document Parsing API parses a document, and returns -

  • Markdown form of the document, and optionally chunk the document into sections or fragments.
  • JSON form of the document, which has more details about the layout of the document.
    • Bounding Boxes for each text, table, figure and page.
    • Page number indexed dictionary of the document.
    • The layout type of individual text, table, figure and other elements on the pages.
  • Tables are encoded as either LaTeX, CSV or Markdown.
  • Tables are summarized in addition to the raw data.
  • Figures are handled by extracting any text or summarizing non-textual visual content.
  • Supported file types - PDF, JPEG, PNG

Quick Start

1

Upload the Document

Upload the document to the API.

This returns a file ID, which is used to parse the document.

{
    "filename":"tensorlake://ff5e96bf-XXX",
    "message":"file uploaded successfully"
}
2

Parse the Document

Parse the document using the parse_async endpoint.

This returns a job ID, which is used to get the result of the parsing.

{
    "jobId":"job-XXX",
    "status":"PROCESSING"
}
3

Get the Result

Get the result from the API.

This returns the result of the parsing.

{
    "jobId":"job-XXXX",
    "chunks": ["chunk1","chunk2","chunk3"],
    "status":"SUCCESSFUL"
}

File Upload API

You can upload a file before parsing it. The file is returned with a tensorlake:// URL, which can be used to parse the document.

You can also provide a pre-signed S3 URL or publicly accessible URL to the parse endpoint.

Listing Uploaded Files

You can list all the documents in a project using the following API call:

curl -X GET https://api.tensorlake.ai/documents/v1/files\
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
-d '{
    "cursor": "xxx", 
    "limit": 10
}'

Parse API Reference

URL: https://api.tensorlake.ai/documents/v1/parse_async

Output Modes

Attribute: outputMode

The Document Parsing API supports two output modes:

  • markdown - Parses the document into markdown, ideal for indexing and other text based post-processing.
  • json - Parses the document into JSON, which has more details about the layout of the document, and details about the page elements.

Chunking Strategy

Attribute: chunkStrategy

Documents are chunked automatically when they are parsed. Each strategy logically divides the parsed output for further processing.

The supported strategies are,

  • None - This is the default option the entire parsed output is returned as one entity.
  • Page - Output data is separated by each individual page using a page number indexed dictionary.
  • Section - The parsing model tries to detect a logical section and returns outputs separated by section.
  • Fragment - The parsing model uses detected fragments (TextBox, Table or Figure) to separate the parsed output.

Switching between Models

Attribute: parseMode

You can trade off between speed and accuracy while parsing.

  • fast: Faster and smaller models are used for low latency parsing.
  • accurate: Slightly slower, uses combinations of vision and VLMs, but more accurate.

Webhook Delivery

Attribute: deliverWebhook

You can configure the API to deliver a webhook when the parsing job finishes. Learn more about Webhooks.

Retrieving the Result

You can retrieve the result of the parsing using the jobId returned from the parse_async endpoint.

status: PROCESSING, SUCCESSFUL or FAILED.

chunks: List of chunks returned from the parsing step.

Listing and Deleting Jobs

Listing Jobs

You can list all the jobs in a project using the following API call:

curl -X GET https://api.tensorlake.ai/documents/v1/jobs\
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"
-d '{
    "cursor": "xxx", 
    "limit": 10
}'

DeletingJobs

You can delete a Job and associated extracted data once you have downloaded the data.

curl -X DELETE https://api.tensorlake.ai/documents/v1/jobs/job-XXX \
-H "Authorization: Bearer YOUR_API_KEY"

Was this page helpful?