- No limits on the number of fields you can extract.
- Extraction is guided by JSON Schema you provide (or Pydantic models with the Python SDK).
- You can submit multiple schemas in a single API call.
Try this out using this Colab Notebook.
Structured Extraction Request
Structured Outputs from Documents can be generated by specifying one or more JSON Schemas in thestructured_extraction_options
parameter in the parse
endpoint.
structured_extraction_options
parameter is an array of objects, where each object contains the schema name and the JSON Schema
to use for structured extraction.
Structured Extraction Response
Structured Data extracted from the document is returned in thestructured_data
field of the Get Parse Job
endpoint response.
The structured_data
field is an array of objects, where each object contains the extracted data, the page numbers from which the data was extracted,
and the schema name used for extraction.
It includes the extracted data and the pages from which the data was extracted.
JSON Schema for Structured Extraction
Both the Python SDK and HTTP API support JSON Schema for structured extraction. If you are using the Python SDK, you can pass in a Python Dictionary, or a JSON schema encoded as a string.Pydantic Models for Structured Extraction
Pydantic models are supported only in the Python SDK. We transform the Pydantic model in many cases to make sure the model is compatible with our LLM.All Structured Extraction Options
The Structured Extraction Options parameter is a list of objects, where each object contains:Parameter | Description | Optional | Default Value |
---|---|---|---|
schema_name | The name of the schema to use for structured data extraction. This will be used as the key in the structured_data field of the response. | No | - |
json_schema | The JSON Schema to use for structured data extraction. This schema will define the structure of the data to be extracted from the document. It should be a valid JSON Schema object. The schema can be used to extract structured data from the document, such as tables, forms, or other structured content. | No | - |
partition_strategy | The strategy to use for partitioning the document for structured data extraction. This can be none , page , or fragment . If not specified, the default is none . This will determine how the document is partitioned for structured data extraction. For example, if page is specified, structured data will be extracted from every page of the document. If fragment is specified, structured data will be extracted from every fragment of the document. This is useful for documents with multiple sections or tables. | Yes | none |
page_classes | An array of page class names to limit the structured data extraction to specific page types. This is useful for documents where structured data is only present on certain pages, such as signature pages or form pages. If not specified, structured data will be extracted from all pages of the document. | Yes | - |
skip_ocr | A boolean flag to skip OCR processing for the structured data extraction. This is useful for documents that are already in a machine-readable format, such as PDFs with embedded text. If set to true , the API will not perform OCR on the document and will only extract structured data from the text present in the document. | Yes | false |
prompt | A custom prompt to use for structured data extraction. This can be used to provide additional context or instructions to the AI model for extracting structured data from the document. If not specified, the default prompt will be used. This is useful for documents with complex structures or specific extraction requirements. | Yes | - |
model-provider | Structured Extraction is performed by using an LLM. At the moment, the following models are supported: tensorlake - Proprietary model specifically trained for structured data extraction, gpt_4o_mini - OpenAI model for structured extraction, sonnet - Anthropic model for structured extraction. | Yes | tensorlake |
Partitioning the Document
You can extract structured data from the whole Document at once, or from every page of the document. Each structured extraction object from thestructured_extraction_options
parameter can specify how the document should be partitioned for structured data extraction.
For this, you can use the partition_strategy
parameter in the JSON Schema of the structured extraction request object.
Not to be confused with the
chunking_strategy
parameter in the
parse_options
property, which controls how the document is chunked for
markdown generation.none
(Default) - Extract structured data from the whole document at once.page
- Extract structured data from every page of the document.
Field‑Level Citations
You can ask Tensorlake to return per‑field citations for structured outputs. When enabled, each extracted field includes bounding box data pointing to where that value came from (page numbers and, when available, bounding boxes you can use to highlight the source region in your UI). Enable this by settingprovide_citations: True
on each StructuredExtractionOptions
object.
Citations add a small latency and payload overhead. We recommend enabling them for review flows, compliance
use cases, and any UI where you highlight “where this came from.”
Working with Citations
- Render highlights: Use page_number + bbox to draw overlays in a document viewer so reviewers can verify values quickly.
- Store provenance: Persist citation anchors alongside your extracted JSON so downstream systems can trace and audit how a value was produced.
- Disambiguate fields: If a field appears multiple times (e.g., “Total” on multiple pages), citations help confirm which instance was used.
Filtering By Page Classes
You can specify a subset of pages to extract structured data from by using thepage_classes
parameter in each structured data extraction
request object.
The top-level
page_range
will limit all parsing, classification, and data extraction capabilities to only those pages.