With the Document Parsing API, you can parse a document with a single API call and you will always get in return:

  • A markdown version of the document, including:
    • Tables encoded as Markdown or HTML
    • [Optional] Markdown chunks based on page, section, or fragment
  • A complete document layout, including:
    • Page numbers
    • Bounding boxes for each fragment found in the document (e.g. signature, key-value pair, figure)
  • [Optional] Page classes based on the page classifications you define
  • [Optional] Structured data based on the schema you define

Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document.

Core Document Parsing Workflow

With a file and an api_key, you can quickly parse a document with a single API call. And more importantly, you can control how the document is parsed.

1

Call the `/parse` endpoint

The /parse endpoint will create a parse job with the following request payload:

  • file: Either the file_id returned from uploading a file to Tensorlake Cloud, or a pre-signed URL or any HTTP URL that can be used to download the file
  • Key options for parsing. See the parse settings below, and get a full list in the API Reference. It is not required to provide this options, and the API will use default settings if it is not present.
  • page_range: The range of pages to parse, in the format 1-2 or 1,3,5. If not specified, all pages will be parsed.
  • labels: Additional metadata to identify the parse request. The labels are returned along with the parse response.

The endpoint will return:

  • parse_id: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.
2

Query the status using the `/parse/{parse_id}` endpoint

The /parse/{parse_id} endpoint will return:

  • status: The status of the parsing job. This can be FAILURE, PENDING, PROCESSING, or SUCCESSFUL.
  • If the parsing job is PENDING or PROCESSING, you should wait a few seconds and then check again by re-calling the endpoint.
3

Retrieve the output using the `/parse/{parse_id}` endpoint

If the /parse/{parse_id} endpoint returns as SUCCESSFUL status, the response payload will include an Response object:

  • chunks: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.
  • document_layout: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.
  • page_classes: This is a map where the keys are page class names provided in the parse request.
  • structured_data: The structured data is a map where the keys are the names of the json schema provided in the parse request, and the values are StructuredData objects.
  • options: The options used for scheduling the parse job.
  • labels: Labels associated with the parse job.

Explore main configuration options

These are the main object properties you can include in your parse request payload to customize the parsing behavior:

ParameterDescription
enrichment_optionsSummarize tables and figures present in the document.
parsing_optionsCustomizes the document parsing process, including table parsing, chunking strategies, and more. See parsing options below.
page_classificationsDefines settings for page classification. When present, the API will perform page classification on the document.
structured_extraction_optionsConfiguration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas.

parsing_options

SettingOptionsDefault Value
chunking_strategyChoose between , Page, Section, or Fragment.None (no chunking)
disable_layout_detectionBoolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.false
remove_strikethrough_linesEnable detection and removal of strikethrough text.false
signature_detectionEnable detection of signatures in the document.false
skew_detectionDetect and correct skewed or rotated pages. Please note this can increase the processing time.false
table_output_modeChoose between Markdown, .HTML
table_parsing_formatChoose between or .TSR

Get a full list of the configuration setting options on the /parse section of the API reference.

Use the /parse API

Calling the /parse enpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.

If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.

from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import ParsingOptions
from tensorlake.documentai.models.enums import (
    ChunkingStrategy,
    TableOutputMode,
    TableParsingFormat,
)

doc_ai = DocumentAI(api_key="xxxx")
file_id = "tensorlake-xxxx"

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.FRAGMENT,
    table_output_mode=TableOutputMode.MARKDOWN,
    table_parsing_format=TableParsingFormat.TSR,
)

parse_id = doc_ai.parse(file_id, page_range="1-2", parsing_options=parsing_options)

Retrieve Output from the Parsing Job

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.

parse = doc_ai.get_parse(parse_id)

The response is a JSON object if you are using the REST API, and a ParseResult object if you are using the Python SDK.

class ParseResult(BaseModel):
    # Parsed document specific fields
    chunks: Optional[List[Chunk]] = Field(
        default=None,
        description="Chunks of layout text extracted from the document. This is a vector of `Chunk` objects, each containing a piece of text extracted from the document. The chunks are typically used for further processing, such as indexing or searching. The value will vary depending on the chunking strategy used during parsing.",
    )
    document_layout: Optional[Document] = Field(
        default=None,
        description="The layout of the document. This is a JSON object that contains the layout information of the document. It can be used to understand the structure of the document, such as the position of text, tables, figures, etc.",
    )
    page_classes: Optional[Dict[str, PageClass]] = Field(
        default=None,
        description="Page classes extracted from the document. This is a map where the keys are page class names provided in the parse request under the `page_classification_options` field, and the values are PageClass objects containing the class name and page numbers where each page class appears.",
    )
    structured_data: Optional[
        Dict[str, Union[StructuredData, List[StructuredData]]]
    ] = Field(
        default=None,
        description="Structured data extracted from the document. The structured data is a map where the keys are the names of the json schema provided in the parse request, and the values are `StructuredData` objects containing the structured data extracted from the document; formatted according to the schema. When the `structured_extraction` option uses a `chunking_strategy` of `None`, the structured data will be extracted from the entire document, and it will be represented as a single entry in the map with the schema name as the key. When the `structured_extraction` option uses a `chunking_strategy`, the structured data will be extracted from each chunk of text, and it will be represented as multiple entries in the map, with the schema name as the key and a vector of `StructuredData` objects as the value. This is used to extract structured information from the document, such as tables, forms, or other structured content.",
    )

    # ParseResult specific fields
    parse_id: str = Field(description="The unique identifier for the parse job")
    parsed_pages_count: int = Field(
        description="The number of pages that were parsed successfully.", ge=0
    )
    status: ParseStatus = Field(description="The status of the parse job.")
    created_at: str = Field(
        description="The date and time when the parse job was created in RFC 3339 format."
    )
    options: ParseRequestOptions = Field(
        description="The options used for scheduling the parse job."
    )

    # Optional fields
    errors: Optional[dict] = Field(
        None, description="Error occurred during any part of the parse execution."
    )
    finished_at: Optional[str] = Field(
        None,
        description="The date and time when the parse job was finished in RFC 3339 format.",
    )
    labels: Optional[dict] = Field(
        None, description="Labels associated with the parse job."
    )
    tasks_completed_count: Optional[int] = Field(
        None,
        description="The number of tasks that have been completed for the parse job.",
        ge=0,
    )
    tasks_total_count: Optional[int] = Field(
        None,
        description="The total number of tasks that are expected to be completed for the parse job.",
        ge=0,
    )

Understand the Parsing Output

The response contains the following fields which returns the parsed document:

  • chunks: An array of objects that contain the markdown content for each chunk. The number of chunks depends on the chunking strategy you chose. See more below.
  • document_layout: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element, and reading order. See more below.
  • page_classes: A map where the keys are page class names provided in the parse request, and the values are PageClass objects containing class names and page numbers where each page class appears.
  • structured_data: A map where the keys are the names of the JSON schema provided in the parse request, and the values are StructuredData objects containing the extracted structured data.
  • parse_id: The unique identifier for the parse job.
  • parsed_pages_count: An integer representing the number of pages that were parsed successfully.
  • status: The status of the parse job.
  • created_at: The date and time when the parse job was created in RFC 3339 format.
  • options: The options used for scheduling the parse job.
  • errors: Any errors encountered while parsing the document.
  • labels: Labels associated with the parse job.

The Outputs class has been documented in the Python SDK and in the REST API.

Markdown Chunks

The markdown content of the document is available in the chunks attribute of the JSON response. The number of chunks depends on the chunking strategy you chose.

Chunking Strategy Options

  • None - The whole document is returned as a single chunk. This allows you to use your own chunking logic.
  • Page - Each page is returned as a separate chunk. You should receive as many chunks as the number of pages in the document.
  • Section - The document is split into chunks based on the section headers detected in the document.
  • Fragment - Every page fragment (e.g. table, figure, paragraph) is returned as a separate chunk. You will most likely have to merge these chunks based on your use-case.

Document Layout and Bounding Boxes

The entire document layout is available in the outputs.document attribute of the JSON response. This object has a list of Pages, each encoded as a JSON object. Each outputs.document.pages[x] contains the following attributes:

  • page_number - The page number of the page.
  • dimensions - The width and height of the page in pixels.
  • page_fragments - The list of objects on the page. Each page fragment has the following attributes:
    • fragment_type - The type of the object: section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethrough
    • reading_order - The reading order of the page fragments. This is the order in which the fragment would be read by a human.
    • bbox - The bounding box of the page fragment, in the format [x1, y1, x2, y2].
    • content - The actual content that is found on that fragment of the page.

Explore Advanced Capabilities

This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options: