Understand how to convert PDF Documents to Markdown for use in AI Agents
file
and an api_key
, you can quickly parse a document with a single API call. And more importantly, you can control
how the document is parsed.
Call the parse endpoint
file_id
returned from uploading a file to Tensorlake Cloud.file_url
that points to a publicly accessible file.raw_text
string.page_range
: The range of pages to parse, in the format 1-2
or 1,3,5
. If not specified, all pages will be parsed.labels
: Additional metadata to identify the parse request. The labels are returned along with the parse response.parse_id
: The unique ID Tensorlake uses to reference the specific parsing job. This ID can be used to get the output when the parsing job is completed and re-visit previously used settings.created_at
: The date and time when the parse job was created in RFC 3339 format.Query the status of the parsing job
/parse/{parse_id}
endpoint will return:status
: The status of the parsing job. This can be failure
, pending
, processing
, or successful
.pending
or processing
, you should wait a few seconds and then check again by re-calling the endpoint.Retrieve the parsed result
successful
, you can retrieve the parsed result by calling the /parse/{parse_id}
endpoint.
The response payload will include an Response
object:finished_at
: The date and time when the parse job was finished in RFC 3339 format.chunks
: An array of objects that contain a chunk number (specified by the chunk strategy) and the markdown content for that chunk.document_layout
: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element (text, tables, figures, signatures), and reading order.page_classes
: This is a map where the keys are page class names provided in the parse request. This value is only present if the page_classifications
were provided in the parse request.page_classes
field will be empty if no page classes were extracted from the document, or if the page_classifications
were not provided in the parse request.structured_data
: The structured data is an array of objects, where each object contains the structured data extracted from the document based on the schema provided in the parse request.structured_data
field will be empty if no structured data was extracted from the document, or if the structured_extraction_options
were not provided in the parse request.labels
: Labels associated with the parse job.Parameter | Description |
---|---|
parsing_options | Customizes the document parsing process, including table parsing, chunking strategies, and more. See Parsing Options. |
enrichment_options | Summarize tables and figures present in the document. See Summarization. |
page_classifications | Defines settings for page classification. When present, the API will perform page classification on the document. See Page Classifications. |
structured_extraction_options | Configuration for structured data extraction. When present, the API will perform structured data extraction based on provided schemas. See Structured Extraction Options |
/parse
section
of the API reference.Parameter | Description | Default Value |
---|---|---|
chunking_strategy | Choose between , Page, Section, or Fragment. | None (no chunking) |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
remove_strikethrough_lines | Enable detection and removal of strikethrough text. | false |
signature_detection | Enable detection of signatures in the document. See Signature Detection. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
pending
state. It will transition to the processing
state and then to the successful
state when it’s parsed successfully.
ParsingOptions
class./parse/{parse_id}
endpoint, or using the get_job
SDK function.
detect_signatures
to true
will ensure all signatures are
detected throughout your document.