The structured extraction API helps in extracting structured data from documents. It’s ideal for automating data extraction from invoices, RFPs, tax and financial statements, and other structured documents.

Structured Data relevant to the schema is extracted from every page of the document, and then conflicts are resolved to produce a single output.

Supported file types - PDF, JPEG, PNG

Quick Start

1

Define a Schema

Define a schema for the document you want to extract data from. Schemas are defined as JSON Schema.

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "string"
    }
  },
  "required": [
    "name",
    "age"
  ]
}
2

Extract Data

This returns a job ID, which is used to get the result of the extraction.

{
    "jobId":"job-XXX",
    "status":"PROCESSING"
}
3

Retrieve the Result

Get the result from the API.

This returns the result of the extraction.

{
    "jobId":"job-XXXX",
    "structuredDataset": [{"customer_name": "John Doe", "address": "123 Main St, Anytown, USA", "total_amount_due": "$1000"}]
}

JSON Schema

The schema that you provide is used to extract structured data from the document. The schema is defined as JSON Schema.

Prompt

We use a proprietary prompt to extract the data from the document. You can provide your own prompt if you want to override the default prompt.

Model Provider

The model provider is the model that is used to extract the data from the document. If you specify the tensorlake provider, it uses our proprietary model to extract the data, and no data is sent to any third party LLM providers.

You can use OpenAI or Anthropic’s models as well.