Extracting Content

With the Document Parsing API, you can parse a document with a single API call and you will always get in return:

A markdown version of the document, including:
- Tables encoded as Markdown or HTML
- Markdown chunks based on page, section, or fragment
A complete document layout, including:
- Page numbers
- Bounding boxes for each fragment found in the document (e.g. signature, key-value pair, figure)
Page classes based on the page classifications you define
- See the Page Classification documentation for more information.
Structured data based on the schema you define
- See the Structured Data Extraction documentation for more information.

Core Document Parsing Workflow

With a file and an api_key, you can quickly parse a document with a single API call. And more importantly, you can control how the document is parsed. The complete parse flow

Call the parse endpoint

The parse endpoint will create a parse job with the following request payload:

A file source, which can be:
- A file_id returned from uploading a file to Tensorlake Cloud.
- A file_url that points to a publicly accessible file.
- A raw_text string.
Key options for parsing. See the parse settings below, and get a full list in the API Reference. It is not required to provide this options, and the API will use default settings for any missing options.
page_range: The range of pages to parse, in the format 1-2 or 1,3,5. If not specified, all pages will be parsed.
labels: Additional metadata to identify the parse request. The labels are returned along with the parse response.

The endpoint will return:

parse_id: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.
created_at: The date and time when the parse job was created in RFC 3339 format.

Query the status of the parsing job

The /parse/{parse_id} endpoint will return:

status: The status of the parsing job. This can be failure, pending, processing, or successful.
If the parsing job is pending or processing, you should wait a few seconds and then check again by re-calling the endpoint.

Retrieve the parsed result

When the parsing job is successful, you can retrieve the parsed result by calling the /parse/{parse_id} endpoint. The response payload will include an Response object:

finished_at: The date and time when the parse job was finished in RFC 3339 format.
chunks: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.
document_layout: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.
page_classes: This is a map where the keys are page class names provided in the parse request. This value is only present if the page_classifications were provided in the parse request.

The page_classes field will be empty if no page classes were extracted from the document, or if the page_classifications were not provided in the parse request.

structured_data: The structured data is an array of objects, where each object contains the structured data extracted from the document based on the schema provided in the parse request.

The structured_data field will be empty if no structured data was extracted from the document, or if the structured_extraction_options were not provided in the parse request.

labels: Labels associated with the parse job.

Options for Parsing Documents

These are the main object properties you can include in your parse request payload to customize the parsing behavior:

Parameter	Description
`parsing_options`	Customizes the document parsing process, including table parsing, chunking strategies, and more. See Parsing Options.
`enrichment_options`	Summarize tables and figures present in the document. See Summarization.
`page_classifications`	Defines settings for page classification. When present, the API will perform page classification on the document. See Page Classifications.
`structured_extraction_options`	Configuration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas. See Structured Extraction Options

Get a full list of the configuration setting options on the /parse section of the API reference.

Parsing Options

Parsing Options include:

Parameter	Description	Default Value
`chunking_strategy`	Choose between , Page, Section, or Fragment.	`None` (no chunking)
`disable_layout_detection`	Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.	`false`
`remove_strikethrough_lines`	Enable detection and removal of strikethrough text.	`false`
`signature_detection`	Enable detection of signatures in the document. See Signature Detection.	`false`
`skew_detection`	Detect and correct skewed or rotated pages. Please note this can increase the processing time.	`false`
`table_output_mode`	Choose between Markdown, .	`HTML`
`table_parsing_format`	Choose between or .	`TSR`

Parse a document with the SDK or API

Calling the parse enpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.

If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.

from tensorlake.documentai import (
  DocumentAI,
  ParsingOptions,
  ChunkingStrategy,
  TableOutputMode,
  TableParsingFormat,
)

doc_ai = DocumentAI(api_key="xxxx")
file_id = "tensorlake-xxxx"

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.FRAGMENT,
    table_output_mode=TableOutputMode.MARKDOWN
)

parse_id = doc_ai.parse(file_id, page_range="1-2", parsing_options=parsing_options)

Retrieve Output from the Parsing Job

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.

result = doc_ai.get_parsed_result(parse_id)

Using Output from Parse Jobs

Leveraging the markdown chunks is a common next step after parsing documents.

Python

markdown_chunks = result.chunks

for chunk_number, chunk in enumerate(markdown_chunks):
    print(f"## CHUNK NUMBER {chunk_number}\n\n")
    print(f"## Page {chunk.page_number}\n\n{chunk.content}\n\n")

See Parse Output for more details about the output.

Explore Advanced Capabilities

This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options:

Structured Data Extraction

By simply specifying a schema, you can extract exactly the data you need from any document, all with the same API call as basic parsing.

Summarization

By setting a few extra Settings, you can ensure all tables, figures, and charts are summarized.

Signature Detection

Setting detect_signatures to true will ensure all signatures are detected throughout your document.

Page Classification

Specify on what types of pages certain structured data can be found for more accurate data retrieval.

Tensorlake

Document Ingestion

Workflows

FAQ

Open Source

Core Document Parsing Workflow

Options for Parsing Documents

Parsing Options

Parse a document with the SDK or API

Retrieve Output from the Parsing Job

Using Output from Parse Jobs

Explore Advanced Capabilities

Structured Data Extraction

Summarization

Signature Detection

Page Classification

Tensorlake

Document Ingestion

Workflows

FAQ

Open Source

​Core Document Parsing Workflow

​Options for Parsing Documents

​Parsing Options

​Parse a document with the SDK or API

​Retrieve Output from the Parsing Job

​Using Output from Parse Jobs

​Explore Advanced Capabilities

Structured Data Extraction

Summarization

Signature Detection

Page Classification

Core Document Parsing Workflow

Options for Parsing Documents

Parsing Options

Parse a document with the SDK or API

Retrieve Output from the Parsing Job

Using Output from Parse Jobs

Explore Advanced Capabilities