- Markdown representation of pages. The elements in pages ordered by their natural reading order
- Tables encoded as Markdown or HTML
- Summary of tables and figures guided by custom prompts
- Bounding boxes for each page element(e.g. signature, key-value pair, figure)
Read the Overview for understanding how to
integrate Document Parsing to your existing workflows.
API Usage Guide
Calling the read endpoint will create a new document parsing job, starting in thepending
state. It will transition to the processing
state and then to the successful
state when it’s parsed successfully.
- Python SDK
- REST API
If you are using the Python SDK, all the configuration options described above are expressed through
the
ParsingOptions
class.Options for Parsing Documents
Document Parsing can be customized by providing theparsing_options
and enrichment_options
in your request.
Parameter | Description |
---|---|
parsing_options | Customizes the OCR and table parsing process and chunking strategies. See Parsing Options. |
enrichment_options | Enables and configures table and figure summarization. See Summarization. |
Get a full list of the configuration setting options on the
/parse
section
of the API reference.Parsing Options
Parameter | Description | Default Value |
---|---|---|
chunking_strategy | Choose between , , , or . | None |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
signature_detection | Detect signatures in the document. Please note this can increase the processing time, and incurs additional costs. | false |
remove_strikethrough_lines | Remove strikethrough lines from the document. Please note this can increase the processing time, and incurs additional costs. | false |
ignore_sections | A set of document fragments to ignore during parsing. This can be useful for excluding irrelevant sections from the output. | [] |
cross_page_header_detection | A boolean flag to enable header hierarchy detection across pages. This can improve the accuracy of header extraction in multi-page documents. | false |
Retrieve Output
The parsed document output can be retrieved using the/parse/{parse_id}
endpoint, or using the get_job
SDK function.
Markdown Chunks
Leveraging the markdown chunks is a common next step after parsing documents.See Parse Output for more details
about the output.
Bounding Boxes
Each page fragment includes bounding box coordinates that specify the exact location of the content on the page. This is useful for creating citations, highlighting source content in a UI, or debugging extraction quality.Accessing Bounding Boxes
Coordinate System
Bounding boxes use the following coordinate system:- x1, y1: Top-left corner of the bounding box
- x2, y2: Bottom-right corner of the bounding box
- Origin (0,0): Top-left corner of the page
- Units: Pixels
Table and Figure Summarization
Document Ingestion API can be used to summarize tables and figures in documents.Parameter | Description | Default Value |
---|---|---|
table_summarization | Enable summarization of tables present in the document. This will generate a summary of the table content, including key insights and trends. | false |
figure_summarization | Enable summarization of figures present in the document. This will generate a summary of the figure content, including key insights and trends. | false |
table_summarization_prompt | A custom prompt to use for table summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used. | - |
figure_summarization_prompt | A custom prompt to use for figure summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used. | - |
include_full_page_image | Include the full page image as additional context when summarizing tables and figures, which can improve accuracy by capturing surrounding headers, captions, and related content. | false |
Tables
Tales can be summarized by settingtable_summarization
to true
in the enrichment_options
JSON object when calling the parse
API.
Figures
Figures can be summarized by settingfigure_summarization
to true
in the enrichment_options
JSON object when calling the parse
API.