/parse/{parse_id} endpoint, or using the get_parsed_result SDK function.
ParseResult object if you are using the Python SDK.
Output Response Fields
The response contains the following fields which returns the parsed document:- parse_id: The unique identifier for the parse job.
- parsed_pages_count: An integer representing the number of pages that were parsed successfully.
- total_pages: An integer representing the total number of pages in the document.
- status: The status of the parse job.
- created_at: The date and time when the parse job was created in RFC 3339 format.
- finished_at: The date and time when the parse job was finished in RFC 3339 format.
- error: Any errors encountered while parsing the document.
- labels: Labels associated with the parse job.
- chunks: An array of objects that contain the markdown content for each chunk. The number of chunks depends on the chunking strategy you chose. See more below.
- pages: A comprehensive JSON representation of the document’s visual structure, including page dimensions, bounding boxes for each element, and reading order. See more below.
- page_classes: A map where the keys are page class names provided in the parse request, and the values are PageClass objects containing class names and page numbers where each page class appears. See more below.
- structured_data: A map where the keys are the names of the JSON schema provided in the parse request, and the values are StructuredData objects containing the extracted structured data. See more below.
- usage: An object containing usage statistics for the parse job, including token counts and parsed page counts for various extraction tasks.
Markdown Chunks
The markdown content of the document is available in thechunks attribute of the JSON response. The number of chunks
depends on the chunking strategy you chose.
Chunking Strategy Options
- None - The whole document is returned as a single chunk. This allows you to use your own chunking logic.
- Page - Each page is returned as a separate chunk. You should receive as many chunks as the number of pages in the document.
- Section - The document is split into chunks based on the section headers detected in the document.
- Fragment - Every page fragment (e.g. table, figure, paragraph) is returned as a separate chunk. You will most likely have to merge these chunks based on your use-case.
DOCX Tracked Changes and Comments
When parsing DOCX files that contain tracked changes or comments, Tensorlake preserves this collaboration metadata in the HTML output. This enables workflows that need to process document revisions, review comments, or extract specific change history. Tracked changes and comments are preserved using semantic HTML markup: Tracked Changes:- Insertions:
<ins>inserted text</ins>- Text that was added to the document - Deletions:
<del>deleted text</del>- Text that was removed or struck through
- Comment ranges:
<span class="comment" data-note="comment text">highlighted text</span>- Comments anchored to selected text - Comment references:
<!-- Comment: comment text -->- Comments at cursor positions without highlighted text
Example Output
Markdown
Extracting Change Data Programmatically
Use these HTML patterns to extract specific content types:Python
Tracked changes are only preserved when parsing DOCX files that contain Microsoft Word’s revision history. Regular text formatting (bold, italic) is handled separately through standard HTML markup.
Document Layout and Bounding Boxes
The entire document layout is available in thepages attribute of the JSON response. This object has a list of Pages, each
encoded as a JSON object. Each pages[x] contains the following attributes:
page_number- The page number of the page.dimensions- The width and height of the page in pixels.page_fragments- The list of objects on the page. Each page fragment has the following attributes:fragment_type- The type of the object:section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethroughreading_order- The reading order of the page fragments. This is the order in which the fragment would be read by a human.bbox- The bounding box of the page fragment, in the format[x1, y1, x2, y2].content- The actual content that is found on that fragment of the page.
Page Classifications
Page classifications are also returned as a list of Page Class objects, which contain the following attributes:page_class: The classification name you provided. This will match thenamefield in yourPageClassConfig.page_numbers: An array of page numbers (1-indexed) that match this classification.
Structured Extraction
Structured Data is returned as a list depending on partition strategy (e.g. one Structured Data object for each partition of the document). Each object containsdata: The JSON object representing the data extracted that matches the input schema.page_numbers: A list of page numbers where the structured data was searched for.schema_name: The name of the schema provided by the user.
Usage
Theusage attribute contains usage statistics for the parse job, including token counts and parsed page counts for various extraction tasks. The fields include:
pages_parsed: The number of pages that were parsed.signature_detected_pages: The number of pages where signatures were detected. This is only applicable if signature detection was enabled.strikethrough_detected_pages: The number of pages where strikethroughs were detected. This is only applicable if strikethrough lines detection was enabled.ocr_input_tokens_used: The number of input tokens used for OCR processing.ocr_output_tokens_used: The number of output tokens generated from OCR processing.extraction_input_tokens_used: The number of input tokens used for text extraction. This is only applicable if structured extraction options were enabled.extraction_output_tokens_used: The number of output tokens generated from text extraction. This is only applicable if structured extraction options were enabled.summarization_input_tokens_used: The number of input tokens used for summarization.summarization_output_tokens_used: The number of output tokens generated from summarization.