> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Structured Data Extraction

> Extract structured fields from documents using one or more JSON Schemas — no field limits, and multiple schemas in a single API call.

Tensorlake can extract structured data from documents. This enables pulling out specific fields from documents. Some key features
of structured extraction are:

* No limits on the number of fields you can extract.
* Extraction is guided by JSON Schema you provide (or Pydantic models with the Python SDK).
* You can submit multiple schemas in a single API call.

<Tip>
  Try this out using this [Colab Notebook](https://tlake.link/parse-bank-statements).
</Tip>

## Structured Extraction Request

Structured Outputs from Documents can be generated by specifying one or more JSON Schemas in the `structured_extraction_options` parameter in the `parse` endpoint.

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel, Field

  from tensorlake.documentai import (
    DocumentAI,
    StructuredExtractionOptions,
  )

  doc_ai = DocumentAI(api_key="YOUR_API_KEY")

  file_id = "file_XXX"  # Replace with your file ID or URL

  class DriverLicense(BaseModel):
      first_name: str = Field(description="Name next to FN")
      last_name: str = Field(description="Name next to LN")
      id: str = Field(description="ID number")
      address: str = Field(description="Address of the ID holder")
      dob: str = Field(description="Date of birth of the ID holder")


  driver_license_extraction = StructuredExtractionOptions(
      schema_name="DriverLicense", json_schema=DriverLicense
  )

  parse_id = doc_ai.extract(
      file_id=file_id, structured_extraction_options=[driver_license_extraction]
  )

  parsed_result = doc_ai.wait_for_completion(parse_id=parse_id)
  ```

  ```bash curl theme={null}
  curl --request POST \
    --url https://api.tensorlake.ai/documents/v2/parse \
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
    --header 'Content-Type: application/json' \
    --data '{
      "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/california_id.jpg",
      "structured_extraction_options": [
        {
          "schema_name": "DriverLicense",
          "json_schema": {
            "type": "object",
            "properties": {
              "first_name": { "type": "string", "description": "Name next to FN" },
              "last_name": { "type": "string", "description": "Name next to LN" },
              "id": { "type": "string", "description": "ID number" },
              "address": { "type": "string", "description": "Address of the ID holder" },
              "dob": { "type": "string", "description": "Date of birth of the ID holder" }
            }
          }
        }
      ]
    }'
  ```

  ```javascript Node.js theme={null}
  async function parseFile(fileUrl, tensorlakeApiKey) {
    const driversSchema = {
      title: "DriverLicense",
      type: "object",
      properties: {
        first_name: { type: "string", description: "Name next to FN" },
        last_name: { type: "string", description: "Name next to LN" },
        id: { type: "string", description: "ID number" },
        address: { type: "string", description: "Address of the ID holder" },
        dob: { type: "string", description: "Date of birth of the ID holder" },
      },
    };

    const driversExtractionOptions = {
      schema_name: "DriverLicense",
      json_schema: driversSchema,
    };

    const body = {
      file_url,
      structured_extraction_options: [driversExtractionOptions],
    };

    const options = {
      method: "POST",
      headers: {
        Authorization: `Bearer ${tensorlakeApiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    };

    const response = await fetch(
      "https://api.tensorlake.ai/documents/v2/parse",
      options
    );

    const result = await response.json();
    console.log("result:", result);
    return result.jobId;
  }

  const fileId =
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/california_id.jpg";
  const tensorlakeApiKey = "your-tensorlake-api-key";

  const jobId = await parseFile(fileId, tensorlakeApiKey);
  ```
</CodeGroup>

The `structured_extraction_options` parameter is an array of objects, where each object contains the schema name and the JSON Schema
to use for structured extraction.

## Structured Extraction Response

Structured Data extracted from the document is returned in the `structured_data` field of the [Get Parse Job](/api-reference/v2/parse/get)
endpoint response.

The `structured_data` field is an array of objects, where each object contains the extracted data, the page numbers from which the data was extracted,
and the schema name used for extraction.

It includes the extracted data and the pages from which the data was extracted.

<CodeGroup>
  ```json JSON theme={null}
  {
    // ... other fields ...
    "structured_data": [
      {
        "data": {
          "first_name": "John",
          "last_name": "Doe",
          "id": "D1234567",
          "address": "123 Main St, Springfield, IL 62701",
          "dob": "1990-01-01"
        },
        "page_numbers": [1, 2, 3],
        "schema_name": "DriverLicense",
      }
    ]
  }
  ```
</CodeGroup>

## JSON Schema for Structured Extraction

Both the Python SDK and HTTP API support JSON Schema for structured extraction. If you are using the Python SDK,
you can pass in a Python Dictionary, or a JSON schema encoded as a string.

<CodeGroup>
  ```python Python theme={null}

  from tensorlake.documentai import StructuredExtractionOptions

  schema = {
    "type": "object",
    "properties": {
      "first_name": { "type": "string", "description": "Name next to FN" },
      "last_name": { "type": "string", "description": "Name next to LN" },
    }
  }

  # or schema = json.dumps(schema)

  driver_license_extraction = StructuredExtractionOptions(
      schema_name="DriverLicense", json_schema=schema
  )

  parse_id = doc_ai.extract(
      file_id=file_id, structured_extraction_options=[driver_license_extraction]
  )

  parsed_result = doc_ai.wait_for_completion(parse_id=parse_id)
  ```
</CodeGroup>

HTTP API accepts a JSON schema as well. Please make sure the schema is a valid JSON **object**, and not encoded as a JSON string.

## Pydantic Models for Structured Extraction

Pydantic models are supported only in the Python SDK. We transform the Pydantic model in many cases to make sure
the model is compatible with our LLM.

## All Structured Extraction Options

The Structured Extraction Options parameter is a list of objects, where each object contains:

| Parameter            | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Optional | Default Value |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | ------------- |
| `schema_name`        | The name of the schema to use for structured data extraction. This will be used as the key in the `structured_data` field of the response.                                                                                                                                                                                                                                                                                                                                                                                          | No       | -             |
| `json_schema`        | The JSON Schema to use for structured data extraction. This schema will define the structure of the data to be extracted from the document. It should be a valid JSON Schema object. The schema can be used to extract structured data from the document, such as tables, forms, or other structured content.                                                                                                                                                                                                                       | No       | -             |
| `partition_strategy` | The strategy to use for partitioning the document for structured data extraction. This can be `none`, `page`, or `fragment`. If not specified, the default is `none`. This will determine how the document is partitioned for structured data extraction. For example, if `page` is specified, structured data will be extracted from every page of the document. If `fragment` is specified, structured data will be extracted from every fragment of the document. This is useful for documents with multiple sections or tables. | Yes      | `none`        |
| `page_classes`       | An array of page class names to limit the structured data extraction to specific page types. This is useful for documents where structured data is only present on certain pages, such as signature pages or form pages. If not specified, structured data will be extracted from all pages of the document.                                                                                                                                                                                                                        | Yes      | -             |
| `skip_ocr`           | A boolean flag to skip OCR processing for the structured data extraction. This is useful for documents that are already in a machine-readable format, such as PDFs with embedded text. If set to `true`, the API will not perform OCR on the document and will only extract structured data from the text present in the document.                                                                                                                                                                                                  | Yes      | `false`       |
| `prompt`             | A custom prompt to use for structured data extraction. This can be used to provide additional context or instructions to the AI model for extracting structured data from the document. If not specified, the default prompt will be used. This is useful for documents with complex structures or specific extraction requirements.                                                                                                                                                                                                | Yes      | -             |
| `model-provider`     | Structured Extraction is performed by using an LLM. At the moment, the following models are supported: `tensorlake` - Proprietary model specifically trained for structured data extraction, `gpt_4o_mini` - OpenAI model for structured extraction, `sonnet` - Anthropic model for structured extraction, `gemini3`- Google Gemini-3 model for structured extraction.                                                                                                                                                              | Yes      | `tensorlake`  |

## Partitioning the Document

You can extract structured data from the whole Document at once, or from every page of the document.

Each structured extraction object from the `structured_extraction_options` parameter can specify how the document should be partitioned for structured data extraction.
For this, you can use the `partition_strategy` parameter in the JSON Schema of the structured extraction request object.

<Note>
  Not to be confused with the `chunking_strategy` parameter in the
  `parse_options` property, which controls how the document is chunked for
  markdown generation.
</Note>

* `none`(*Default*) - Extract structured data from the whole document at once.
* `page` - Extract structured data from every page of the document.
* `section` - Extract structured data from each section of the document.
* `PatternPartitionStrategy` - Extract structured data from within each block specified by the start and end pattern.

### Pattern-Based Partitioning

Pattern-based partitioning uses regex patterns to partition documents for structured extraction, regardless of page boundaries or document layout. This is ideal when target data is consistently marked by text patterns but appears in different locations across similar documents.
Instead of processing entire documents or fixed page ranges, you define start and end patterns that isolate extraction zones. For example, extract financial data between "Property Summary" and "Total Property" sections, or contract clauses between "Section 4.2" and "Section 4.3" headers.

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel, Field
  from tensorlake.documentai import DocumentAI, StructuredExtractionOptions

  class FinancialSummary(BaseModel):
    property_value: str = Field(description="Total property value")
    assessment_date: str = Field(description="Date of assessment")

    extraction_options = StructuredExtractionOptions(
      schema_name="FinancialSummary",
      json_schema=FinancialSummary,
      partition_strategy={
        "strategy": "patterns",
        "patterns": {
          "start_patterns": ["\bProperty\s+Summary\b"],
          "end_patterns": ["\bTotal\s+Property\b"]
        }
      }
    )

    parse_id = doc_ai.extract(
      file_id=file_id,
      structured_extraction_options=[extraction_options]
    )
  ```

  ```bash curl theme={null}
  curl --request POST \
    --url https://api.tensorlake.ai/documents/v2/parse \
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
    --header 'Content-Type: application/json' \
    --data '{
      "file_url": "https://example.com/financial-report.pdf",
      "structured_extraction_options": [
        {
          "schema_name": "FinancialSummary",
          "partition_strategy": {
            "strategy": "patterns",
            "patterns": {
              "start_patterns": ["\\\\bProperty\\\\s+Summary\\\\b"],
              "end_patterns": ["\\\\bTotal\\\\s+Property\\\\b"]
            }
          },
          "json_schema": {
            "type": "object",
            "properties": {
              "property_value": { "type": "string", "description": "Total property value" },
              "assessment_date": { "type": "string", "description": "Date of assessment" }
            }
          }
        }
      ]
    }'
  ```
</CodeGroup>

Pattern Configuration:

* `start_patterns`: Array of regex patterns that mark the beginning of extraction zones
* `end_patterns`: Array of regex patterns that mark the end of extraction zones
* Use both to extract data between markers, or use only `start_patterns` to extract from marker to document end
* Patterns are case-sensitive regex expressions; use `\\b` for word boundaries and `\\s+` for flexible whitespace matching

This approach eliminates brittle page-based extraction and focuses on content structure, making your extraction pipeline resilient to document layout variations.

## Field‑Level Citations

You can ask Tensorlake to return per‑field citations for structured outputs. When enabled, each extracted field
includes bounding box data pointing to where that value came from (page numbers and, when available, bounding boxes
you can use to highlight the source region in your UI).

<Info>
  [Google Colab Notebook](https://tlake.link/notebooks/citations)
</Info>

Enable this by setting `provide_citations: True` on each `StructuredExtractionOptions` object.

<Note> Citations add a small latency and payload overhead. We recommend enabling them for review flows, compliance
use cases, and any UI where you highlight “where this came from.” </Note>

### Working with Citations

* **Render highlights**: Use page\_number + bbox to draw overlays in a document viewer so reviewers can verify values quickly.
* **Store provenance**: Persist citation anchors alongside your extracted JSON so downstream systems can trace and audit how a value was produced.
* **Disambiguate fields**: If a field appears multiple times (e.g., “Total” on multiple pages), citations help confirm which instance was used.

<CodeGroup>
  ```python Python theme={null}
  from pydantic import BaseModel, Field
  from tensorlake.documentai import DocumentAI, StructuredExtractionOptions

  doc_ai = DocumentAI(api_key="YOUR_API_KEY")
  file_id = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/dl_pen.jpeg" 

  class DriverLicense(BaseModel):
      first_name: str = Field(description="Name next to FN")
      last_name: str = Field(description="Name next to LN")
      id: str = Field(description="ID number")
      address: str = Field(description="Address of the ID holder")
      dob: str = Field(description="Date of birth of the ID holder")

  driver_license_extraction = StructuredExtractionOptions(
      schema_name="DriverLicense",
      json_schema=DriverLicense,
      provide_citations=True,
  )

  parse_id = doc_ai.extract(
      file_id=file_id, structured_extraction_options=[driver_license_extraction]
  )
  result = doc_ai.wait_for_completion(parse_id=parse_id)
  ```

  ```bash curl  theme={null}
  curl --request POST \ 
    --url https://api.tensorlake.ai/documents/v2/parse \ 
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \ 
    --header 'Content-Type: application/json' \ 
    --data '{ 
      "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/dl_pen.jpeg", 
      "structured_extraction_options": [ 
        { 
          "schema_name": "DriverLicense", 
          "provide_citations": true, 
          "json_schema": { 
            "type": "object", 
            "properties": { 
              "first_name": { "type": "string", "description": "Name next to FN" }, 
              "last_name": { "type": "string", "description": "Name next to LN" }, 
              "id": { "type": "string", "description": "ID number" }, 
              "address": { "type": "string", "description": "Address of the ID holder" }, 
              "dob": { "type": "string", "description": "Date of birth of the ID holder" } 
            } 
          } 
        } 
      ] 
    }' 
  ```
</CodeGroup>

Running this on [this driver's license](https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/dl_pen.jpeg), for example, would yield these results:

<CodeGroup>
  ```json JSON theme={null}
  [
    {
      "data": {
        "address": "123 MAIN STREET APT. 1 HARRISBURG, PA 17101-0000",
        "address_citation": [
          {
            "page_number": 1,
            "x1": 337,
            "x2": 714,
            "y1": 144,
            "y2": 371
          }
        ],
        "dob": "01/07/1973",
        "dob_citation": [
          {
            "page_number": 1,
            "x1": 337,
            "x2": 714,
            "y1": 144,
            "y2": 371
          }
        ],
        "first_name": "ANDREW",
        "first_name_citation": [
          {
            "page_number": 1,
            "x1": 337,
            "x2": 714,
            "y1": 144,
            "y2": 371
          }
        ],
        "id": "99 999 999",
        "id_citation": [
          {
            "page_number": 1,
            "x1": 337,
            "x2": 714,
            "y1": 144,
            "y2": 371
          }
        ],
        "last_name": "SAMPLE",
        "last_name_citation": [
          {
            "page_number": 1,
            "x1": 337,
            "x2": 714,
            "y1": 144,
            "y2": 371
          }
        ]
      },
      "page_numbers": [
        1
      ],
      "schema_name": "summary"
    }
  ]
  ```
</CodeGroup>

## Filtering By Page Classes

You can specify a subset of pages to extract structured data from by using the `page_classes` parameter in each structured data extraction
request object.

<Note>
  The top-level `page_range` will limit all parsing, classification, and data extraction capabilities to only those pages.
</Note>

<CodeGroup>
  ```bash curl theme={null}
  curl --request POST \
    --url https://api.tensorlake.ai/documents/v2/parse \
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
    --header 'Content-Type: application/json' \
    --data '{
    "page_range": "1-3",
    "file_id": "file_XXX",  # Replace with your file ID
    "page_classifications": [
      {
        "name": "front_of_dl",
        "description": "Pages that have a photo of a person."
      },
      {
        "name": "back_of_dl",
        "description": "Pages that have a barcode."
      }
    ]
    "structured_extraction_options": [
      {
        "schema_name": "DriverLicense",
        "json_schema": {
          "title": "DriverLicense",
          "type": "object",
          "page_classes": [ "front_of_dl" ],
          "properties": {
            "name": { "type": "string", "description": "Name of the ID holder" },
            "age": { "type": "integer", "description": "Age of the ID holder" },
            "address": { "type": "string", "description": "Address of the ID holder" },
            "dob": { "type": "string", "description": "Date of birth of the ID holder" }
          }
        }
      }
    ]
  }'
  ```

  ```python Python theme={null}
  from pydantic import BaseModel, Field

  from tensorlake.documentai import (
    DocumentAI,
    StructuredExtractionOptions,
  )

  doc_ai = DocumentAI(api_key="YOUR_API_KEY")

  file_id = "tensorlake-XXX" # Replace with your file ID or URL

  page_classifications = [
      PageClassConfig(
          name="front_of_dl",
          description="Pages that have a photo of a person."
      ),
      PageClassConfig(
          name="back_of_dl",
          description="Pages that have a barcode."
      ),
  ]

  class DriverLicense(BaseModel):
    first_name: str = Field(description="Name next to FN")
    last_name: str = Field(description="Name next to LN")
    id: str = Field(description="ID number")
    address: str = Field(description="Address of the ID holder")
    dob: str = Field(description="Date of birth of the ID holder")

  driver_license_extraction = StructuredExtractionOptions(
    schema_name="DriverLicense", 
    json_schema=DriverLicense,
    page_classes=["front_of_dl"]
  )

  parse_id = doc_ai.parse(
    file=file_id, 
    page_range="1-3,9-10",
    page_classifications=page_classifications,
    structured_extraction_options=[driver_license_extraction])
  ```
</CodeGroup>

## Advanced: Reusing OCR Output for Structured Extraction

If you're iterating on an extraction schema, you don't need to re-run OCR every time. The `read` and `extract` APIs are independent steps — `extract` operates on text, not the original PDF. By uploading your Markdown output as a file once, you get a `file_id` you can pass to any number of `extract` calls.

<CodeGroup>
  ```python Python theme={null}
  from tensorlake.documentai import (
      DocumentAI,
      ParsingOptions,
      StructuredExtractionOptions,
  )
  from pydantic import BaseModel, Field

  doc_ai = DocumentAI(api_key="YOUR_API_KEY")

  # Step 1: Run OCR once
  parse_id = doc_ai.read(file_id="file_XXX")
  result = doc_ai.wait_for_completion(parse_id=parse_id)

  # Step 2: Save the Markdown output and upload it
  markdown = "\n\n".join(chunk.content for chunk in result.chunks)
  with open("output.md", "w") as f:
      f.write(markdown)

  markdown_file_id = doc_ai.upload(path="output.md")

  # Step 3: Iterate on your schema — no OCR cost on subsequent runs
  class Invoice(BaseModel):
      vendor_name: str = Field(description="Name of the vendor")
      total_amount: str = Field(description="Total amount due")

  extract_id = doc_ai.extract(
      file_id=markdown_file_id,
      structured_extraction_options=[
          StructuredExtractionOptions(
              schema_name="Invoice",
              json_schema=Invoice,
          )
      ],
  )
  extraction_result = doc_ai.wait_for_completion(parse_id=extract_id)
  ```

  ```bash curl theme={null}
  # Step 1: Run OCR once and save the markdown from the response

  # Step 2: Upload the Markdown output as a reusable file
  curl -X POST https://api.tensorlake.ai/documents/v2/files \
    -H "Authorization: Bearer ${TENSORLAKE_API_KEY}" \
    -F "file=@output.md"
  # Returns: { "file_id": "file_XXX" }

  # Step 3: Run extraction against the Markdown file
  curl --request POST \
    --url https://api.tensorlake.ai/documents/v2/extract \
    --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
    --header 'Content-Type: application/json' \
    --data '{
      "file_id": "file_XXX",
      "structured_extraction_options": [
        {
          "schema_name": "Invoice",
          "json_schema": {
            "type": "object",
            "properties": {
              "vendor_name": { "type": "string", "description": "Name of the vendor" },
              "total_amount": { "type": "string", "description": "Total amount due" }
            }
          }
        }
      ]
    }'
  ```
</CodeGroup>

<Note>
  When extracting from a Markdown file, `page`-based partitioning is not available since page boundaries are not preserved in the text. You can still use the default `none` strategy (whole document) or [pattern-based partitioning](#pattern-based-partitioning).
</Note>

## Tips

#### Skip OCR

Some times document parsing doesn't work well on certain documents, which can lead to poor structured data extraction.
We recommend skipping the OCR step if you care about only structured data extraction. This will make use of a Vision Language Model
trained to extract JSON from Document Images.

You should try this out in case you are seeing poor accuracy in structured data extraction.

#### Describe the Fields

Adding descriptions to the fields in the schema always improves the accuracy of the structured data extraction.
Help the model understand the context of the fields you are extracting, and if possible mention what text
or visual cues to look for in the document for each field.

#### Don't compute new data in the schema

We don't recommend make the LLM derive new information while performing structured extraction. For ex, if you ask the
model to sum up all the rows in a table and return this in a new field, the model will likely hallucinate.

We recommend doing this in your application code in a downstream task.
