With the Document Parsing API, you can parse a document with a single API call and you will always get in return:

  • A markdown version of the document, including:
    • Tables encoded as Markdown or HTML
    • [Optional] Markdown chunks based on page, section, or fragment
  • A complete document layout, including:
    • Page numbers
    • Bounding boxes for each fragment found in the document (e.g. signature, key-value pair, figure)
  • [Optional] Structured data based on the schema you define

Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document.

Core Document Parsing Workflow

With a file_id and an api_key, you can quickly parse a document with a single API call. And more importantly, you can control how the document is parsed.

1

Call the `/parse` endpoint

The /parse endpoint will create a parse job with the following request payload:

The endpoint will return:

  • parse_id: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.
2

Query the status using the `/parse/{parse_id}` endpoint

The /parse/{parse_id} endpoint will return:

  • status: The status of the parsing job. This can be FAILURE, PENDING, PROCESSING, or SUCCESSFUL.
  • If the parsing job is PENDING or PROCESSING, you should wait a few seconds and then check again by re-calling the endpoint.
3

Retrieve the output using the `/parse/{parse_id}` endpoint

If the /parse/{parse_id} endpoint returns as SUCCESSFUL status, the response payload will include an Output object:

  • outputs.chunks: An array of objects that contain a chunk number and the markdown content for that chunk.
  • outputs.document: A JSON representation of the entire document, including any errors that happened while parsing.
  • outputs.num_pages: An integer representing the number of pages that were parsed.
  • [optionally] outputs.structured_data: An array of objects that contain the structured data as JSON and the page number where it was found.

Explore Parse Configuration Settings

The main configuration settings for the parsing job are:

SettingOptionsDefault Value
pagesSelect the range of pages to parse.None (all pages)
tableOutputModeChoose between Markdown, , or JSON."markdown"
tableParsingModeChoose between or ."tsr"
chunkStrategyChoose between , Page, Section, or Fragment.None (a single md document)
detectStrikethrough remove lines that have a strikethrough.false
deliverWebhookDeliver a webhook when the parsing job finishes. Learn how to configure webhooks here.false

Get a full list of the configuration setting options on the /parse section of the API reference.

Use the /parse API

Calling the /parse enpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.

If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.

from tensorlake.documentai import DocumentAI
from tensorlake.documentai.parse import ( 
    ParsingOptions, 
    TableOutputMode, 
    TableParsingStrategy
)

doc_ai = DocumentAI(api_key="xxxx")
options = ParsingOptions(
    table_output_mode=TableOutputMode.MARKDOWN,
    table_parsing_strategy=TableParsingStrategy.TSR,
    # Don't specify the page range if you want to parse the whole document.
    page_range='1-2'
)

job_id = doc_ai.parse(file_id, options)

Retrieve Output from the Parsing Job

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.

job = doc_ai.get_job(parse_id)

The response is a JSON object if you are using the REST API, and a Job object if you are using the Python SDK.

class Job(BaseModel):
    id: str
    status: JobStatus
    outputs: Optional[Output] = None

class Output(BaseModel):
    num_pages: Optional[int] = 0
    document: Optional[Document] = None
    chunks: List[Chunk] = Field(alias="chunks", default_factory=list)
    structured_data: Optional[StructuredData] = None
    error_message: Optional[str] = Field(alias="errorMessage", default="")

class Document(BaseModel):
    pages: List[Page]

class Page(BaseModel):
    page_number: int
    page_fragments: Optional[List[PageFragment]] = []
    layout: Optional[dict] = {}

class PageFragment(BaseModel):
    fragment_type: PageFragmentType
    content: Union[Text, Table, Figure, Signature]
    reading_order: Optional[int] = None
    page_number: Optional[int] = None
    bbox: Optional[dict[str, float]] = None

class JobStatus(str, Enum):
    PROCESSING = "processing"
    SUCCESSFUL = "successful"
    FAILURE = "failure"
    PENDING = "pending"

class Chunk(BaseModel):
    page_number: int
    content: str

Understand the Parsing Output

The outputs attribute of the response contains the following fields which returns the parsed document.

  • outputs.num_pages: An integer representing the number of pages that were parsed.
  • outputs.chunks: An array of objects that contain a chunk number and the markdown content for that chunk. See more below.
  • outputs.document: A JSON representation of the entire document, including any errors that happened while parsing. See more below.
  • [optionally] outputs.structured_data: An array of objects that contain the structured data as JSON and the page number where it was found.
  • outputs.errors: The errors encountered while parsing the document.

The Outputs class has been documented in the Python SDK and in the REST API.

Markdown Chunks

The markdown content of the document is available in the outputs.chunks attribute of the JSON response. The number of chunks depends on the chunking strategy you chose.

Chunking Strategy Options

  • None - The whole document is returned as a single chunk. This allows you to use your own chunking logic.
  • Page - Each page is returned as a separate chunk. You should receive as many chunks as the number of pages in the document.
  • Section - The document is split into chunks based on the section headers detected in the document.
  • Fragment - Every page frament (e.g. table, figure, paragraph) is returned as a separate chunk. You will most likely have to merge these chunks based on your use-case.

Document Layout and Bounding Boxes

The entire document layout is available in the outputs.document attribute of the JSON response. This object has a list of Pages, each encoded as a JSON object. Each outputs.document.pages[x] contains the following attributes:

  • page_number - The page number of the page.
  • dimensions - The width and height of the page in pixels.
  • page_fragments - The list of objects on the page. Each page fragment has the following attributes:
    • fragment_type - The type of the object. Currently we detect the following types:
      • Section Header
      • Text
      • Table
      • Form
      • Formula
      • Figure
      • Signature
    • reading_order - The reading order of the page fragments. This is the order in which the fragment would be read by a human.
    • bounding_box - The bounding box of the page fragment, in the format [x1, y1, x2, y2].
    • content - The actual content that is found on that fragment of the page.

Explore Advanced Capabilities

This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options: