With the Document Parsing API, you can convert Documents to Markdown and get additional layout information for use in document pre-processing workflows.
  • A markdown version of the document:
    • The page elements are ordered by their natural reading order
    • Levels of section headers are detected and preserved in the markdown
    • Tables encoded as Markdown or HTML
    • Figures and Tables optionally summarized
    • Bounding boxes for each page element found in the document (e.g. signature, key-value pair, figure)
Read the Overview for understanding how to integrate Document Parsing to your existing workflows.

Parse a document with the SDK or API

Calling the parse endpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.
If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.
from tensorlake.documentai import (
  DocumentAI,
  ParsingOptions,
  ChunkingStrategy,
  TableOutputMode,
  TableParsingFormat,
)

doc_ai = DocumentAI(api_key="xxxx")
file_id = "tensorlake-xxxx"

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.FRAGMENT,
    table_output_mode=TableOutputMode.MARKDOWN
)

parse_id = doc_ai.parse(file_id, page_range="1-2", parsing_options=parsing_options)

Retrieve Output

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.
result = doc_ai.get_parsed_result(parse_id)

Markdown Chunks

Leveraging the markdown chunks is a common next step after parsing documents.
markdown_chunks = result.chunks

for chunk in markdown_chunks:
    print(f"## Page Number: {chunk.page_number}\n")
    print(f"## Content: {chunk.content}\n")
See Parse Output for more details about the output.

Options for Parsing Documents

These are the main object properties you can include in your parse request payload to customize the parsing behavior:
ParameterDescription
parsing_optionsCustomizes the document parsing process, including table parsing, chunking strategies, and more. See Parsing Options.
enrichment_optionsSummarize tables and figures present in the document. See Summarization.
Get a full list of the configuration setting options on the /parse section of the API reference.

Parsing Options

Parsing Options include:
ParameterDescriptionDefault Value
chunking_strategyChoose between , Page, Section, or Fragment.None (no chunking)
table_output_modeChoose between Markdown, .HTML
table_parsing_formatChoose between or .TSR
disable_layout_detectionBoolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.false
skew_detectionDetect and correct skewed or rotated pages. Please note this can increase the processing time.false

Explore Advanced Capabilities

This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options: