With the Document Parsing API, you can parse a document with a single API call and you will always get in return:
  • A markdown version of the document, including:
    • Tables encoded as Markdown or HTML
    • Markdown chunks based on page, section, or fragment
  • A complete document layout, including:
    • Page numbers
    • Bounding boxes for each fragment found in the document (e.g. signature, key-value pair, figure)
  • Page classes based on the page classifications you define
  • Structured data based on the schema you define

Core Document Parsing Workflow

With a file and an api_key, you can quickly parse a document with a single API call. And more importantly, you can control how the document is parsed. The complete parse flow
1

Call the parse endpoint

The parse endpoint will create a parse job with the following request payload:
  • A file source, which can be:
  • Key options for parsing. See the parse settings below, and get a full list in the API Reference. It is not required to provide this options, and the API will use default settings for any missing options.
  • page_range: The range of pages to parse, in the format 1-2 or 1,3,5. If not specified, all pages will be parsed.
  • labels: Additional metadata to identify the parse request. The labels are returned along with the parse response.
The endpoint will return:
  • parse_id: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.
  • created_at: The date and time when the parse job was created in RFC 3339 format.
2

Query the status of the parsing job

The /parse/{parse_id} endpoint will return:
  • status: The status of the parsing job. This can be failure, pending, processing, or successful.
  • If the parsing job is pending or processing, you should wait a few seconds and then check again by re-calling the endpoint.
3

Retrieve the parsed result

When the parsing job is successful, you can retrieve the parsed result by calling the /parse/{parse_id} endpoint. The response payload will include an Response object:
  • finished_at: The date and time when the parse job was finished in RFC 3339 format.
  • chunks: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.
  • document_layout: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.
  • page_classes: This is a map where the keys are page class names provided in the parse request. This value is only present if the page_classifications were provided in the parse request.
The page_classes field will be empty if no page classes were extracted from the document, or if the page_classifications were not provided in the parse request.
  • structured_data: The structured data is an array of objects, where each object contains the structured data extracted from the document based on the schema provided in the parse request.
The structured_data field will be empty if no structured data was extracted from the document, or if the structured_extraction_options were not provided in the parse request.
  • labels: Labels associated with the parse job.

Options for Parsing Documents

These are the main object properties you can include in your parse request payload to customize the parsing behavior:
ParameterDescription
parsing_optionsCustomizes the document parsing process, including table parsing, chunking strategies, and more. See Parsing Options.
enrichment_optionsSummarize tables and figures present in the document. See Summarization.
page_classificationsDefines settings for page classification. When present, the API will perform page classification on the document. See Page Classifications.
structured_extraction_optionsConfiguration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas. See Structured Extraction Options
Get a full list of the configuration setting options on the /parse section of the API reference.

Parsing Options

Parsing Options include:
ParameterDescriptionDefault Value
chunking_strategyChoose between , Page, Section, or Fragment.None (no chunking)
disable_layout_detectionBoolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.false
remove_strikethrough_linesEnable detection and removal of strikethrough text.false
signature_detectionEnable detection of signatures in the document. See Signature Detection.false
skew_detectionDetect and correct skewed or rotated pages. Please note this can increase the processing time.false
table_output_modeChoose between Markdown, .HTML
table_parsing_formatChoose between or .TSR

Parse a document with the SDK or API

Calling the parse enpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.
If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.
from tensorlake.documentai import (
  DocumentAI,
  ParsingOptions,
  ChunkingStrategy,
  TableOutputMode,
  TableParsingFormat,
)

doc_ai = DocumentAI(api_key="xxxx")
file_id = "tensorlake-xxxx"

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.FRAGMENT,
    table_output_mode=TableOutputMode.MARKDOWN
)

parse_id = doc_ai.parse(file_id, page_range="1-2", parsing_options=parsing_options)

Retrieve Output from the Parsing Job

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.
result = doc_ai.get_parsed_result(parse_id)

Using Output from Parse Jobs

Leveraging the markdown chunks is a common next step after parsing documents.
Python
markdown_chunks = result.chunks

for chunk_number, chunk in enumerate(markdown_chunks):
    print(f"## CHUNK NUMBER {chunk_number}\n\n")
    print(f"## Page {chunk.page_number}\n\n{chunk.content}\n\n")
See Parse Output for more details about the output.

Explore Advanced Capabilities

This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options: