- A markdown version of the document:
- The page elements are ordered by their natural reading order
- Levels of section headers are detected and preserved in the markdown
- Tables encoded as Markdown or HTML
- Figures and Tables optionally summarized
- Bounding boxes for each page element found in the document (e.g. signature, key-value pair, figure)
Read the Overview for understanding how to
integrate Document Parsing to your existing workflows.
Parse a document with the SDK or API
Calling the parse endpoint will create a new document parsing job, starting in thepending
state. It will transition to the processing
state and then to the successful
state when itβs parsed successfully.
If you are using the Python SDK, all the configuration options described above are expressed through
the
ParsingOptions
class.Retrieve Output
The parsed document output can be retrieved using the/parse/{parse_id}
endpoint, or using the get_job
SDK function.
Markdown Chunks
Leveraging the markdown chunks is a common next step after parsing documents.See Parse Output for more details
about the output.
Options for Parsing Documents
These are the main object properties you can include in your parse request payload to customize the parsing behavior:Parameter | Description |
---|---|
parsing_options | Customizes the document parsing process, including table parsing, chunking strategies, and more. See Parsing Options. |
enrichment_options | Summarize tables and figures present in the document. See Summarization. |
Get a full list of the configuration setting options on the
/parse
section
of the API reference.Parsing Options
Parsing Options include:Parameter | Description | Default Value |
---|---|---|
chunking_strategy | Choose between , Page, Section, or Fragment. | None (no chunking) |
table_output_mode | Choose between Markdown, . | HTML |
table_parsing_format | Choose between or . | TSR |
disable_layout_detection | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images. | false |
skew_detection | Detect and correct skewed or rotated pages. Please note this can increase the processing time. | false |
signature_detection | Detect signatures in the document. Please note this can increase the processing time, and incurs additional costs. | false |
remove_strikethrough_lines | Remove strikethrough lines from the document. Please note this can increase the processing time, and incurs additional costs. | false |
ignore_sections | A set of document fragments to ignore during parsing. This can be useful for excluding irrelevant sections from the output. | [] |
cross_page_header_detection | A boolean flag to enable header hierarchy detection across pages. This can improve the accuracy of header extraction in multi-page documents. | false |
Explore Advanced Capabilities
This page covered the basic reading capabilities of the Document Parsing API. In addition to this, you can also use Document Ingestion for more advanced parsing. Explore these options:Structured Data Extraction
By simply specifying a schema, you can extract exactly the data you need
from any document, all with the same API call as basic parsing.
Summarization
By setting a few extra Settings, you can ensure all tables, figures, and
charts are summarized.
Signature Detection
Setting
detect_signatures
to true
will ensure all signatures are
detected throughout your document.Page Classification
Specify on what types of pages certain structured data can be found for more
accurate data retrieval.