structured_extraction_options
parameter in the parse
endpoint.
structured_extraction_options
parameter is an array of objects, where each object contains the schema name and the JSON Schema
to use for structured extraction.
structured_data
field of the Get Parse Job
endpoint response.
The structured_data
field is an array of objects, where each object contains the extracted data, the page numbers from which the data was extracted,
and the schema name used for extraction.
It includes the extracted data and the pages from which the data was extracted.
Parameter | Description | Optional | Default Value |
---|---|---|---|
schema_name | The name of the schema to use for structured data extraction. This will be used as the key in the structured_data field of the response. | No | - |
json_schema | The JSON Schema to use for structured data extraction. This schema will define the structure of the data to be extracted from the document. It should be a valid JSON Schema object. The schema can be used to extract structured data from the document, such as tables, forms, or other structured content. | No | - |
partition_strategy | The strategy to use for partitioning the document for structured data extraction. This can be none , page , or fragment . If not specified, the default is none . This will determine how the document is partitioned for structured data extraction. For example, if page is specified, structured data will be extracted from every page of the document. If fragment is specified, structured data will be extracted from every fragment of the document. This is useful for documents with multiple sections or tables. | Yes | none |
page_classes | An array of page class names to limit the structured data extraction to specific page types. This is useful for documents where structured data is only present on certain pages, such as signature pages or form pages. If not specified, structured data will be extracted from all pages of the document. | Yes | - |
skip_ocr | A boolean flag to skip OCR processing for the structured data extraction. This is useful for documents that are already in a machine-readable format, such as PDFs with embedded text. If set to true , the API will not perform OCR on the document and will only extract structured data from the text present in the document. | Yes | false |
prompt | A custom prompt to use for structured data extraction. This can be used to provide additional context or instructions to the AI model for extracting structured data from the document. If not specified, the default prompt will be used. This is useful for documents with complex structures or specific extraction requirements. | Yes | - |
model-provider | Structured Extraction is performed by using an LLM. At the moment, the following models are supported: tensorlake - Proprietary model specifically trained for structured data extraction, gpt_4o_mini - OpenAI model for structured extraction, sonnet - Anthropic model for structured extraction. | Yes | tensorlake |
structured_extraction_options
parameter can specify how the document should be partitioned for structured data extraction.
For this, you can use the partition_strategy
parameter in the JSON Schema of the structured extraction request object.
chunking_strategy
parameter in the
parse_options
property, which controls how the document is chunked for
markdown generation.none
(Default) - Extract structured data from the whole document at once.page
- Extract structured data from every page of the document.fragment
- Extract structured data from every fragment of the document. This is useful for documents with multiple sections or tables.page_classes
parameter in each structured data extraction
request object.
page_range
will limit all parsing, classification, and data extraction capabilities to only those pages.