Understand how to convert PDF Documents to Markdown for use in AI Agents
With the Document Parsing API, you can parse a document with a single API call and you will always get in return:
Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document.
With a file
and an api_key
, you can quickly parse a document with a single API call. And more importantly, you can control
how the document is parsed.
Call the `/parse` endpoint
The /parse
endpoint will create a parse job with the following request payload:
file
: Either the file_id
returned from uploading a file to Tensorlake Cloud, or a pre-signed URL or any HTTP URL that can be used to download the filepage_range
: The range of pages to parse, in the format 1-2
or 1,3,5
. If not specified, all pages will be parsed.labels
: Additional metadata to identify the parse request. The labels are returned along with the parse response.The endpoint will return:
parse_id
: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.Query the status using the `/parse/{parse_id}` endpoint
The /parse/{parse_id}
endpoint will return:
status
: The status of the parsing job. This can be FAILURE
, PENDING
, PROCESSING
, or SUCCESSFUL
.PENDING
or PROCESSING
, you should wait a few seconds and then check again by re-calling the endpoint.Retrieve the output using the `/parse/{parse_id}` endpoint
If the /parse/{parse_id}
endpoint returns as SUCCESSFUL
status, the response payload will include an Response
object:
chunks
: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.document_layout
: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.page_classes
: This is a map where the keys are page class names provided in the parse request.structured_data
: The structured data is a map where the keys are the names of the json schema provided in the parse request, and the values are StructuredData
objects.options
: The options used for scheduling the parse job.labels
: Labels associated with the parse job.These are the main object properties you can include in your parse request payload to customize the parsing behavior:
Parameter | Description |
---|---|
enrichment_options | Summarize tables and figures present in the document. |
parsing_options | Customizes the document parsing process, including table parsing, chunking strategies, and more. See parsing options below. |
page_classifications | Defines settings for page classification. When present, the API will perform page classification on the document. |
structured_extraction_options | Configuration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas. |
parsing_options
Setting | Options | Default Value |
---|---|---|
chunking_strategy | Choose between , Page, Section, or Fragment. | None (no chunking) |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
remove_strikethrough_lines | Enable detection and removal of strikethrough text. | false |
signature_detection | Enable detection of signatures in the document. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
Get a full list of the configuration setting options on the /parse
section
of the API reference.
/parse
APICalling the /parse
enpoint will create a new document parsing job, starting in the pending
state. It will transition to the processing
state and then to the successful
state when it’s parsed successfully.
If you are using the Python SDK, all the configuration options described above are expressed through
the ParsingOptions
class.
If you are using the Python SDK, all the configuration options described above are expressed through
the ParsingOptions
class.
The HTTP API for parsing is thoroughly documented here. Here is an example of how to initiate a parsing job:
The parsed document output can be retrieved using the /parse/{parse_id}
endpoint, or using the get_job
SDK function.
The response is a JSON object if you are using the REST API, and a ParseResult
object if you are using the Python SDK.
The response contains the following fields which returns the parsed document:
The Outputs class has been documented in the Python SDK and in the REST API.
The markdown content of the document is available in the chunks
attribute of the JSON response. The number of chunks
depends on the chunking strategy you chose.
Chunking Strategy Options
The entire document layout is available in the outputs.document
attribute of the JSON response. This object has a list of Pages, each
encoded as a JSON object. Each outputs.document.pages[x]
contains the following attributes:
page_number
- The page number of the page.dimensions
- The width and height of the page in pixels.page_fragments
- The list of objects on the page. Each page fragment has the following attributes:
fragment_type
- The type of the object: section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethrough
reading_order
- The reading order of the page fragments. This is the order in which the fragment would be read by a human.bbox
- The bounding box of the page fragment, in the format [x1, y1, x2, y2]
.content
- The actual content that is found on that fragment of the page.This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options:
By simply specifying a schema, you can extract exactly the data you need from any document, all with the same API call as basic parsing.
By setting a few extra Settings, you can ensure all tables, figures, and charts are summarized.
Setting detect_signatures
to true
will ensure all signatures are
detected throughout your document.
Specify on what types of pages certain structured data can be found for more accurate data retrieval.
Understand how to convert PDF Documents to Markdown for use in AI Agents
With the Document Parsing API, you can parse a document with a single API call and you will always get in return:
Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document.
With a file
and an api_key
, you can quickly parse a document with a single API call. And more importantly, you can control
how the document is parsed.
Call the `/parse` endpoint
The /parse
endpoint will create a parse job with the following request payload:
file
: Either the file_id
returned from uploading a file to Tensorlake Cloud, or a pre-signed URL or any HTTP URL that can be used to download the filepage_range
: The range of pages to parse, in the format 1-2
or 1,3,5
. If not specified, all pages will be parsed.labels
: Additional metadata to identify the parse request. The labels are returned along with the parse response.The endpoint will return:
parse_id
: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.Query the status using the `/parse/{parse_id}` endpoint
The /parse/{parse_id}
endpoint will return:
status
: The status of the parsing job. This can be FAILURE
, PENDING
, PROCESSING
, or SUCCESSFUL
.PENDING
or PROCESSING
, you should wait a few seconds and then check again by re-calling the endpoint.Retrieve the output using the `/parse/{parse_id}` endpoint
If the /parse/{parse_id}
endpoint returns as SUCCESSFUL
status, the response payload will include an Response
object:
chunks
: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.document_layout
: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.page_classes
: This is a map where the keys are page class names provided in the parse request.structured_data
: The structured data is a map where the keys are the names of the json schema provided in the parse request, and the values are StructuredData
objects.options
: The options used for scheduling the parse job.labels
: Labels associated with the parse job.These are the main object properties you can include in your parse request payload to customize the parsing behavior:
Parameter | Description |
---|---|
enrichment_options | Summarize tables and figures present in the document. |
parsing_options | Customizes the document parsing process, including table parsing, chunking strategies, and more. See parsing options below. |
page_classifications | Defines settings for page classification. When present, the API will perform page classification on the document. |
structured_extraction_options | Configuration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas. |
parsing_options
Setting | Options | Default Value |
---|---|---|
chunking_strategy | Choose between , Page, Section, or Fragment. | None (no chunking) |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
remove_strikethrough_lines | Enable detection and removal of strikethrough text. | false |
signature_detection | Enable detection of signatures in the document. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
Get a full list of the configuration setting options on the /parse
section
of the API reference.
/parse
APICalling the /parse
enpoint will create a new document parsing job, starting in the pending
state. It will transition to the processing
state and then to the successful
state when it’s parsed successfully.
If you are using the Python SDK, all the configuration options described above are expressed through
the ParsingOptions
class.
If you are using the Python SDK, all the configuration options described above are expressed through
the ParsingOptions
class.
The HTTP API for parsing is thoroughly documented here. Here is an example of how to initiate a parsing job:
The parsed document output can be retrieved using the /parse/{parse_id}
endpoint, or using the get_job
SDK function.
The response is a JSON object if you are using the REST API, and a ParseResult
object if you are using the Python SDK.
The response contains the following fields which returns the parsed document:
The Outputs class has been documented in the Python SDK and in the REST API.
The markdown content of the document is available in the chunks
attribute of the JSON response. The number of chunks
depends on the chunking strategy you chose.
Chunking Strategy Options
The entire document layout is available in the outputs.document
attribute of the JSON response. This object has a list of Pages, each
encoded as a JSON object. Each outputs.document.pages[x]
contains the following attributes:
page_number
- The page number of the page.dimensions
- The width and height of the page in pixels.page_fragments
- The list of objects on the page. Each page fragment has the following attributes:
fragment_type
- The type of the object: section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethrough
reading_order
- The reading order of the page fragments. This is the order in which the fragment would be read by a human.bbox
- The bounding box of the page fragment, in the format [x1, y1, x2, y2]
.content
- The actual content that is found on that fragment of the page.This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options:
By simply specifying a schema, you can extract exactly the data you need from any document, all with the same API call as basic parsing.
By setting a few extra Settings, you can ensure all tables, figures, and charts are summarized.
Setting detect_signatures
to true
will ensure all signatures are
detected throughout your document.
Specify on what types of pages certain structured data can be found for more accurate data retrieval.