Tensorlake can extract structured data from documents. This enables pulling out specific fields from documents. Some key features of structured extraction are -

  • No limits on the number of fields you can extract.
  • Extraction is guided by JSON Schema you provide(or Pydantic models with the Python SDK).
  • Structured data can be extracted along with markdown representation of the document in a single API call, without having to parse the document twice.

Structured Extraction Request

The same Parse Endpoint is used for structured extraction. You can specify the schema in the extraction_options parameter in the parse endpoint.

curl --request POST \
  --url https://api.tensorlake.ai/documents/v1/parse \
  --header 'Authorization: Bearer tl_apiKey_XXXX' \
  --header 'Content-Type: application/json' \
  --data '{
    "file": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/california_id.jpg",
    "settings": {
      "modelProvider": "tensorlake",
      "jsonSchema": {
        "title": "DriverLicense",
        "type":  "object",
        "properties": {
          "name": { "type": "string",  "description": "user name" },
          "age":  { "type": "integer" }
        }
      }
    }
  }'

Attribute: jsonSchema

The JSON schema guides the structured extraction of data from the document. The schema is defined as a JSON object.

Response

Structured Data Extracted from the document is returned in the outputs field of the response.

The outputs field is a JSON object with the following attributes relevant to structured extraction -

  • structured_data - The structured data extracted from the document.

It contains a list of structured data objects, extracted from every page of the document.

Chunking

You can extract structured data from the whole Document at once, or from every page of the document.

Specify the chunk_strategy parameter in the parse endpoint to control this.

  • none(Default) - Extract structured data from the whole document at once.
  • page - Extract structured data from every page of the document.

Extracting from a Subset of Pages

You can specify a page range to extract structured data from specific pages of the document.

  • pages(Optional) - Specify the page range to extract structured data from. Ex: "pages": 1-3. Default is all pages.
curl --request POST \
  --url https://api.tensorlake.ai/documents/v1/parse \
  --header 'Authorization: Bearer tl_apiKey_XXXX' \
  --header 'Content-Type: application/json' \
  --data '{
  "pages": "1-3",
  "settings": {
    "modelProvider": "tensorlake",
    "jsonSchema": {
      "title": "DriverLicense",
      "type":  "object",
      "properties": {
        "name": { "type": "string",  "description": "user name" },
        "age":  { "type": "integer" }
      }
    }
  },
  "file": "tensorlake-XXX"
}'

Model

Structured Extraction is performed by using an LLM. At the moment, the following models are supported -

  • tensorlake - This is our own model specifically trained for structured data extraction.
  • gpt-4o-mini - In case you want to use OpenAI’s models for structured extraction.
  • claude-3-5-sonnet-20240620 - In case you want to use Anthropic’s models for structured extraction.

Tips

Skip OCR

Some times document parsing doesn’t work well on certain documents, which can lead to poor structured data extraction. We recommend skipping the OCR step if you care about only structured data extraction. This will make use of a Vision Language Model trained to extract JSON from Document Images.

You should try this out in case you are seeing poor accuracy in structured data extraction.

Describe the Fields

Adding descriptions to the fields in the schema always improves the accuracy of the structured data extraction. Help the model understand the context of the fields you are extracting, and if possible mention what text or visual cues to look for in the document for each field.

Don’t compute new data in the schema

We don’t recommend make the LLM derive new information while performing structured extraction. For ex, if you ask the model to sum up all the rows in a table and return this in a new field, the model will likely hallucinate.

We recommend doing this in your application code in a downstream task.