Tensorlake can extract structured data from documents. This enables pulling out specific fields from documents. Some key features of structured extraction are:
  • No limits on the number of fields you can extract.
  • Extraction is guided by JSON Schema you provide (or Pydantic models with the Python SDK).
  • You can submit multiple schemas in a single API call.
Try this out using this Colab Notebook.

Structured Extraction Request

Structured Outputs from Documents can be generated by specifying one or more JSON Schemas in the structured_extraction_options parameter in the parse endpoint.
from pydantic import BaseModel, Field

from tensorlake.documentai import (
  DocumentAI,
  StructuredExtractionOptions,
)

doc_ai = DocumentAI(api_key="YOUR_API_KEY")

file_id = "file_XXX"  # Replace with your file ID or URL

class DriverLicense(BaseModel):
    first_name: str = Field(description="Name next to FN")
    last_name: str = Field(description="Name next to LN")
    id: str = Field(description="ID number")
    address: str = Field(description="Address of the ID holder")
    dob: str = Field(description="Date of birth of the ID holder")


driver_license_extraction = StructuredExtractionOptions(
    schema_name="DriverLicense", json_schema=DriverLicense
)

parse_id = doc_ai.extract(
    file_id=file_id, structured_extraction_options=[driver_license_extraction]
)

parsed_result = doc_ai.wait_for_completion(parse_id=parse_id)
The structured_extraction_options parameter is an array of objects, where each object contains the schema name and the JSON Schema to use for structured extraction.

Structured Extraction Response

Structured Data extracted from the document is returned in the structured_data field of the Get Parse Job endpoint response. The structured_data field is an array of objects, where each object contains the extracted data, the page numbers from which the data was extracted, and the schema name used for extraction. It includes the extracted data and the pages from which the data was extracted.
{
  // ... other fields ...
  "structured_data": [
    {
      "data": {
        "first_name": "John",
        "last_name": "Doe",
        "id": "D1234567",
        "address": "123 Main St, Springfield, IL 62701",
        "dob": "1990-01-01"
      },
      "page_numbers": [1, 2, 3],
      "schema_name": "DriverLicense",
    }
  ]
}

JSON Schema for Structured Extraction

Both the Python SDK and HTTP API support JSON Schema for structured extraction. If you are using the Python SDK, you can pass in a Python Dictionary, or a JSON schema encoded as a string.

from tensorlake.documentai import StructuredExtractionOptions

schema = {
  "type": "object",
  "properties": {
    "first_name": { "type": "string", "description": "Name next to FN" },
    "last_name": { "type": "string", "description": "Name next to LN" },
  }
}

# or schema = json.dumps(schema)

driver_license_extraction = StructuredExtractionOptions(
    schema_name="DriverLicense", json_schema=schema
)

parse_id = doc_ai.extract(
    file_id=file_id, structured_extraction_options=[driver_license_extraction]
)

parsed_result = doc_ai.wait_for_completion(parse_id=parse_id)
HTTP API accepts a JSON schema as well. Please make sure the schema is a valid JSON object, and not encoded as a JSON string.

Pydantic Models for Structured Extraction

Pydantic models are supported only in the Python SDK. We transform the Pydantic model in many cases to make sure the model is compatible with our LLM.

All Structured Extraction Options

The Structured Extraction Options parameter is a list of objects, where each object contains:
ParameterDescriptionOptionalDefault Value
schema_nameThe name of the schema to use for structured data extraction. This will be used as the key in the structured_data field of the response.No-
json_schemaThe JSON Schema to use for structured data extraction. This schema will define the structure of the data to be extracted from the document. It should be a valid JSON Schema object. The schema can be used to extract structured data from the document, such as tables, forms, or other structured content.No-
partition_strategyThe strategy to use for partitioning the document for structured data extraction. This can be none, page, or fragment. If not specified, the default is none. This will determine how the document is partitioned for structured data extraction. For example, if page is specified, structured data will be extracted from every page of the document. If fragment is specified, structured data will be extracted from every fragment of the document. This is useful for documents with multiple sections or tables.Yesnone
page_classesAn array of page class names to limit the structured data extraction to specific page types. This is useful for documents where structured data is only present on certain pages, such as signature pages or form pages. If not specified, structured data will be extracted from all pages of the document.Yes-
skip_ocrA boolean flag to skip OCR processing for the structured data extraction. This is useful for documents that are already in a machine-readable format, such as PDFs with embedded text. If set to true, the API will not perform OCR on the document and will only extract structured data from the text present in the document.Yesfalse
promptA custom prompt to use for structured data extraction. This can be used to provide additional context or instructions to the AI model for extracting structured data from the document. If not specified, the default prompt will be used. This is useful for documents with complex structures or specific extraction requirements.Yes-
model-providerStructured Extraction is performed by using an LLM. At the moment, the following models are supported: tensorlake - Proprietary model specifically trained for structured data extraction, gpt_4o_mini - OpenAI model for structured extraction, sonnet - Anthropic model for structured extraction.Yestensorlake

Partitioning the Document

You can extract structured data from the whole Document at once, or from every page of the document. Each structured extraction object from the structured_extraction_options parameter can specify how the document should be partitioned for structured data extraction. For this, you can use the partition_strategy parameter in the JSON Schema of the structured extraction request object.
Not to be confused with the chunking_strategy parameter in the parse_options property, which controls how the document is chunked for markdown generation.
  • none(Default) - Extract structured data from the whole document at once.
  • page - Extract structured data from every page of the document.

Field‑Level Citations

You can ask Tensorlake to return per‑field citations for structured outputs. When enabled, each extracted field includes bounding box data pointing to where that value came from (page numbers and, when available, bounding boxes you can use to highlight the source region in your UI). Enable this by setting provide_citations: True on each StructuredExtractionOptions object.
Citations add a small latency and payload overhead. We recommend enabling them for review flows, compliance use cases, and any UI where you highlight “where this came from.”

Working with Citations

  • Render highlights: Use page_number + bbox to draw overlays in a document viewer so reviewers can verify values quickly.
  • Store provenance: Persist citation anchors alongside your extracted JSON so downstream systems can trace and audit how a value was produced.
  • Disambiguate fields: If a field appears multiple times (e.g., “Total” on multiple pages), citations help confirm which instance was used.
from pydantic import BaseModel, Field
from tensorlake.documentai import DocumentAI, StructuredExtractionOptions

doc_ai = DocumentAI(api_key="YOUR_API_KEY")
file_id = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/dl_pen.jpeg" 

class DriverLicense(BaseModel):
    first_name: str = Field(description="Name next to FN")
    last_name: str = Field(description="Name next to LN")
    id: str = Field(description="ID number")
    address: str = Field(description="Address of the ID holder")
    dob: str = Field(description="Date of birth of the ID holder")

driver_license_extraction = StructuredExtractionOptions(
    schema_name="DriverLicense",
    json_schema=DriverLicense,
    provide_citations=True,
)

parse_id = doc_ai.extract(
    file_id=file_id, structured_extraction_options=[driver_license_extraction]
)
result = doc_ai.wait_for_completion(parse_id=parse_id)
Running this on this driver’s license, for example, would yield these results:
[
  {
    "data": {
      "address": "123 MAIN STREET APT. 1 HARRISBURG, PA 17101-0000",
      "address_citation": [
        {
          "page_number": 1,
          "x1": 337,
          "x2": 714,
          "y1": 144,
          "y2": 371
        }
      ],
      "dob": "01/07/1973",
      "dob_citation": [
        {
          "page_number": 1,
          "x1": 337,
          "x2": 714,
          "y1": 144,
          "y2": 371
        }
      ],
      "first_name": "ANDREW",
      "first_name_citation": [
        {
          "page_number": 1,
          "x1": 337,
          "x2": 714,
          "y1": 144,
          "y2": 371
        }
      ],
      "id": "99 999 999",
      "id_citation": [
        {
          "page_number": 1,
          "x1": 337,
          "x2": 714,
          "y1": 144,
          "y2": 371
        }
      ],
      "last_name": "SAMPLE",
      "last_name_citation": [
        {
          "page_number": 1,
          "x1": 337,
          "x2": 714,
          "y1": 144,
          "y2": 371
        }
      ]
    },
    "page_numbers": [
      1
    ],
    "schema_name": "summary"
  }
]

Filtering By Page Classes

You can specify a subset of pages to extract structured data from by using the page_classes parameter in each structured data extraction request object.
The top-level page_range will limit all parsing, classification, and data extraction capabilities to only those pages.
curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
  --header 'Content-Type: application/json' \
  --data '{
  "page_range": "1-3",
  "file_id": "file_XXX",  # Replace with your file ID
  "page_classifications": [
    {
      "name": "front_of_dl",
      "description": "Pages that have a photo of a person."
    },
    {
      "name": "back_of_dl",
      "description": "Pages that have a barcode."
    }
  ]
  "structured_extraction_options": [
    {
      "schema_name": "DriverLicense",
      "json_schema": {
        "title": "DriverLicense",
        "type": "object",
        "page_classes": [ "front_of_dl" ],
        "properties": {
          "name": { "type": "string", "description": "Name of the ID holder" },
          "age": { "type": "integer", "description": "Age of the ID holder" },
          "address": { "type": "string", "description": "Address of the ID holder" },
          "dob": { "type": "string", "description": "Date of birth of the ID holder" }
        }
      }
    }
  ]
}'

Tips

Skip OCR

Some times document parsing doesn’t work well on certain documents, which can lead to poor structured data extraction. We recommend skipping the OCR step if you care about only structured data extraction. This will make use of a Vision Language Model trained to extract JSON from Document Images. You should try this out in case you are seeing poor accuracy in structured data extraction.

Describe the Fields

Adding descriptions to the fields in the schema always improves the accuracy of the structured data extraction. Help the model understand the context of the fields you are extracting, and if possible mention what text or visual cues to look for in the document for each field.

Don’t compute new data in the schema

We don’t recommend make the LLM derive new information while performing structured extraction. For ex, if you ask the model to sum up all the rows in a table and return this in a new field, the model will likely hallucinate. We recommend doing this in your application code in a downstream task.