Structured Data Extraction

Tensorlake can extract structured data from documents. This enables pulling out specific fields from documents. Some key features of structured extraction are:

No limits on the number of fields you can extract.
Extraction is guided by JSON Schema you provide (or Pydantic models with the Python SDK).
Structured data can be extracted along with markdown representation of the document in a single API call, without having to parse the document twice.
You can submit multiple schemas in a single API call, and the model will extract data from the document according to each schema.

Try this out using this Colab Notebook.

Structured Extraction Request

The same Parse Endpoint is used for structured extraction. You can specify the schema in the structured_extraction_options parameter in the parse endpoint.

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
  --header 'Content-Type: application/json' \
  --data '{
    "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/california_id.jpg",
    "structured_extraction_options": [
      {
        "schema_name": "DriverLicense",
        "json_schema": {
          "type": "object",
          "properties": {
            "first_name": { "type": "string", "description": "Name next to FN" },
            "last_name": { "type": "string", "description": "Name next to LN" },
            "id": { "type": "string", "description": "ID number" },
            "address": { "type": "string", "description": "Address of the ID holder" },
            "dob": { "type": "string", "description": "Date of birth of the ID holder" }
          }
        }
      }
    ]
  }'

The structured_extraction_options parameter is an array of objects, where each object contains the schema name and the JSON Schema to use for structured extraction.

Structured Extraction Response

Structured Data extracted from the document is returned in the structured_data field of the Get Parse Job endpoint response. The structured_data field is an array of objects, where each object contains the extracted data, the page numbers from which the data was extracted, and the schema name used for extraction. It includes the extracted data and the pages from which the data was extracted.

{
  // ... other fields ...
  "structured_data": [
    {
      "data": {
        "first_name": "John",
        "last_name": "Doe",
        "id": "D1234567",
        "address": "123 Main St, Springfield, IL 62701",
        "dob": "1990-01-01"
      },
      "page_numbers": [1, 2, 3],
      "schema_name": "DriverLicense",
    }
  ]
}

All Structured Extraction Options

The Structured Extraction Options parameter is a list of objects, where each object contains:

Parameter	Description	Optional	Default Value
`schema_name`	The name of the schema to use for structured data extraction. This will be used as the key in the `structured_data` field of the response.	No	-
`json_schema`	The JSON Schema to use for structured data extraction. This schema will define the structure of the data to be extracted from the document. It should be a valid JSON Schema object. The schema can be used to extract structured data from the document, such as tables, forms, or other structured content.	No	-
`partition_strategy`	The strategy to use for partitioning the document for structured data extraction. This can be `none`, `page`, or `fragment`. If not specified, the default is `none`. This will determine how the document is partitioned for structured data extraction. For example, if `page` is specified, structured data will be extracted from every page of the document. If `fragment` is specified, structured data will be extracted from every fragment of the document. This is useful for documents with multiple sections or tables.	Yes	`none`
`page_classes`	An array of page class names to limit the structured data extraction to specific page types. This is useful for documents where structured data is only present on certain pages, such as signature pages or form pages. If not specified, structured data will be extracted from all pages of the document.	Yes	-
`skip_ocr`	A boolean flag to skip OCR processing for the structured data extraction. This is useful for documents that are already in a machine-readable format, such as PDFs with embedded text. If set to `true`, the API will not perform OCR on the document and will only extract structured data from the text present in the document.	Yes	`false`
`prompt`	A custom prompt to use for structured data extraction. This can be used to provide additional context or instructions to the AI model for extracting structured data from the document. If not specified, the default prompt will be used. This is useful for documents with complex structures or specific extraction requirements.	Yes	-
`model-provider`	Structured Extraction is performed by using an LLM. At the moment, the following models are supported: `tensorlake` - Proprietary model specifically trained for structured data extraction, `gpt_4o_mini` - OpenAI model for structured extraction, `sonnet` - Anthropic model for structured extraction.	Yes	`tensorlake`

Partitioning the Document

You can extract structured data from the whole Document at once, or from every page of the document. Each structured extraction object from the structured_extraction_options parameter can specify how the document should be partitioned for structured data extraction. For this, you can use the partition_strategy parameter in the JSON Schema of the structured extraction request object.

Not to be confused with the chunking_strategy parameter in the parse_options property, which controls how the document is chunked for markdown generation.

none(Default) - Extract structured data from the whole document at once.
page - Extract structured data from every page of the document.
fragment - Extract structured data from every fragment of the document. This is useful for documents with multiple sections or tables.

Extracting from a Subset of Pages

By default, structured data extraction is performed on all pages of the document. You can specify a subset of pages to extract structured data from by using the page_classes parameter in each structured data extraction request object.

The top-level page_range will limit all parsing, classification, and data extraction capabilities to only those pages.

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
  --header 'Content-Type: application/json' \
  --data '{
  "page_range": "1-3",
  "file_id": "file_XXX",  # Replace with your file ID
  "page_classifications": [
    {
      "name": "front_of_dl",
      "description": "Pages that have a photo of a person."
    },
    {
      "name": "back_of_dl",
      "description": "Pages that have a barcode."
    }
  ]
  "structured_extraction_options": [
    {
      "schema_name": "DriverLicense",
      "json_schema": {
        "title": "DriverLicense",
        "type": "object",
        "page_classes": [ "front_of_dl" ],
        "properties": {
          "name": { "type": "string", "description": "Name of the ID holder" },
          "age": { "type": "integer", "description": "Age of the ID holder" },
          "address": { "type": "string", "description": "Address of the ID holder" },
          "dob": { "type": "string", "description": "Date of birth of the ID holder" }
        }
      }
    }
  ]
}'

Tips

Skip OCR

Some times document parsing doesn’t work well on certain documents, which can lead to poor structured data extraction. We recommend skipping the OCR step if you care about only structured data extraction. This will make use of a Vision Language Model trained to extract JSON from Document Images. You should try this out in case you are seeing poor accuracy in structured data extraction.

Describe the Fields

Adding descriptions to the fields in the schema always improves the accuracy of the structured data extraction. Help the model understand the context of the fields you are extracting, and if possible mention what text or visual cues to look for in the document for each field.

Don’t compute new data in the schema

We don’t recommend make the LLM derive new information while performing structured extraction. For ex, if you ask the model to sum up all the rows in a table and return this in a new field, the model will likely hallucinate. We recommend doing this in your application code in a downstream task.

Tensorlake

Document Ingestion

Workflows

FAQ

Open Source

Structured Data Extraction

Structured Extraction Request

Structured Extraction Response

All Structured Extraction Options

Partitioning the Document

Extracting from a Subset of Pages

Tips

Skip OCR

Describe the Fields

Don’t compute new data in the schema

Tensorlake

Document Ingestion

Workflows

FAQ

Open Source

​Structured Extraction Request

​Structured Extraction Response

​All Structured Extraction Options

​Partitioning the Document

​Extracting from a Subset of Pages

​Tips

​Skip OCR

​Describe the Fields

​Don’t compute new data in the schema

Structured Extraction Request

Structured Extraction Response

All Structured Extraction Options

Partitioning the Document

Extracting from a Subset of Pages

Tips

Skip OCR

Describe the Fields

Don’t compute new data in the schema