Structured Data Extraction
Tensorlake can extract structured data from documents. This enables pulling out specific fields from documents. Some key features of structured extraction are -
- No limits on the number of fields you can extract.
- Extraction is guided by JSON Schema you provide(or Pydantic models with the Python SDK).
- Structured data can be extracted along with markdown representation of the document in a single API call, without having to parse the document twice.
- You can submit multiple schemas in a single API call, and the model will extract data from the document according to each schema.
Structured Extraction Request
The same Parse Endpoint is used for structured extraction. You can specify the schema in the structured_extraction_options
parameter in the parse
endpoint.
The structured_extraction_options
parameter is an array of objects, where each object contains the schema name and the JSON Schema
to use for structured extraction.
Response
Structured Data Extracted from the document is returned in the outputs
field of the response.
Structured Data extracted from the document is returned in the structured_data
field of the Get Parse Job
endpoint response.
The structured_data
field is a JSON object where each key is the schema name you provided in the structured_extraction_options
parameter,
making it easy to access the structured data for each schema.
It includes the extracted data and the pages from which the data was extracted.
Chunking
You can extract structured data from the whole Document at once, or from every page of the document.
Specify the chunking_strategy
parameter in the structured_extraction_options
object to control how the document is chunked for structured data extraction.
Not to be confused with the chunking_strategy
parameter in the
parse_options
property, which controls how the document is chunked for
markdown generation.
none
(Default) - Extract structured data from the whole document at once.page
- Extract structured data from every page of the document.fragment
- Extract structured data from every fragment of the document. This is useful for documents with multiple sections or tables.
Extracting from a Subset of Pages
By default, structured data extraction is performed on all pages of the document.
You can specify a subset of pages to extract structured data from by using the page_classes
parameter in each structured data extraction
request object.
The top-level page_range
will limit all parsing, classification, and data extraction capabilities to only those pages.
Model
Structured Extraction is performed by using an LLM. At the moment, the following models are supported -
tensorlake
- This is our own model specifically trained for structured data extraction.gpt-4o-mini
- In case you want to use OpenAI’s models for structured extraction.claude-3-5-sonnet-20240620
- In case you want to use Anthropic’s models for structured extraction.
Tips
Skip OCR
Some times document parsing doesn’t work well on certain documents, which can lead to poor structured data extraction. We recommend skipping the OCR step if you care about only structured data extraction. This will make use of a Vision Language Model trained to extract JSON from Document Images.
You should try this out in case you are seeing poor accuracy in structured data extraction.
Describe the Fields
Adding descriptions to the fields in the schema always improves the accuracy of the structured data extraction. Help the model understand the context of the fields you are extracting, and if possible mention what text or visual cues to look for in the document for each field.
Don’t compute new data in the schema
We don’t recommend make the LLM derive new information while performing structured extraction. For ex, if you ask the model to sum up all the rows in a table and return this in a new field, the model will likely hallucinate.
We recommend doing this in your application code in a downstream task.