- Markdown representation of pages. The elements in pages ordered by their natural reading order
- Tables encoded as Markdown or HTML
- Summary of tables and figures guided by custom prompts
- Bounding boxes for each page element(e.g. signature, key-value pair, figure)
Read the Overview for understanding how to
integrate Document Parsing to your existing workflows.
API Usage Guide
Calling the read endpoint will create a new document parsing job, starting in thepending
state. It will transition to the processing
state and then to the successful
state when itβs parsed successfully.
- Python SDK
- REST API
If you are using the Python SDK, all the configuration options described above are expressed through
the
ParsingOptions
class.Options for Parsing Documents
Document Parsing can be customized by providing theparsing_options
and enrichment_options
in your request.
Parameter | Description |
---|---|
parsing_options | Customizes the OCR and table parsing process and chunking strategies. See Parsing Options. |
enrichment_options | Enables and configures table and figure summarization. See Summarization. |
Get a full list of the configuration setting options on the
/parse
section
of the API reference.Parsing Options
Parameter | Description | Default Value |
---|---|---|
chunking_strategy | Choose between , , , or . | None |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
signature_detection | Detect signatures in the document. Please note this can increase the processing time, and incurs additional costs. | false |
remove_strikethrough_lines | Remove strikethrough lines from the document. Please note this can increase the processing time, and incurs additional costs. | false |
ignore_sections | A set of document fragments to ignore during parsing. This can be useful for excluding irrelevant sections from the output. | [] |
cross_page_header_detection | A boolean flag to enable header hierarchy detection across pages. This can improve the accuracy of header extraction in multi-page documents. | false |
Retrieve Output
The parsed document output can be retrieved using the/parse/{parse_id}
endpoint, or using the get_job
SDK function.
Markdown Chunks
Leveraging the markdown chunks is a common next step after parsing documents.See Parse Output for more details
about the output.
Table and Figure Summarization
Document Ingestion API can be used to summarize tables and figures in documents.Parameter | Description | Default Value |
---|---|---|
table_summarization | Enable summarization of tables present in the document. This will generate a summary of the table content, including key insights and trends. | false |
figure_summarization | Enable summarization of figures present in the document. This will generate a summary of the figure content, including key insights and trends. | false |
table_summarization_prompt | A custom prompt to use for table summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used. | - |
figure_summarization_prompt | A custom prompt to use for figure summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used. | - |
Tables
Tales can be summarized by settingtable_summarization
to true
in the enrichment_options
JSON object when calling the parse
API.
Figures
Figures can be summarized by settingfigure_summarization
to true
in the enrichment_options
JSON object when calling the parse
API.