Skip to main content
POST
/
documents
/
v2
/
datasets
cURL
curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/datasets \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false,
    "ocr_model": "model01"
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ],
      "provide_citations": true
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null,
    "include_full_page_image": false
  },
  "name": "invoices dataset",
  "description": "This dataset contains all invoices from 2023."
}'
{
  "name": "invoices dataset",
  "dataset_id": "dataset_12345",
  "created_at": "2023-10-01T12:00:00Z"
}
Create an ingestion workflow for structured extraction or document parsing. A dataset is a collection of settings that help with organizing documents from the same domain and enable focused document intelligence. The dataset’s name must be unique. Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document. To read more about the configuration options, see the Parse Documents endpoint.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

This object defines the request body for creating a new dataset.

A Dataset is a collection of parsed results from files.

It can be used to store and manage related data, such as invoices, receipts, or any other documents that need to be parsed and analyzed.

Once a dataset is created, you can use it to parse related files using the same configuration and options, allowing for consistent and efficient data extraction.

name
string
required

The name of the dataset.

The name can only contain alphanumeric characters, hyphens, and underscores.

The name must be unique within the organization and project context.

Example:

"invoices dataset"

parsing_options
object

The properties of this object define the configuration for the document parsing process.

Tensorlake provides sane defaults that work well for most documents, so this object is not required. However, every document is different, and you may want to customize the parsing process to better suit your needs.

structured_extraction_options
object[] | null

The properties of this object define the configuration for structured data extraction.

If this object is present, the API will perform structured data extraction on the document.

page_classifications
object[] | null

The properties of this object define the configuration for page classify.

If this object is present, the API will perform page classify on the document.

enrichment_options
object

The properties of this object help to extend the output of the document parsing process with additional information.

This includes summarization of tables and figures, which can help to provide a more comprehensive understanding of the document.

This object is not required, and the API will use default settings if it is not present.

description
string | null

A description of the dataset.

This field is optional and can be used to provide additional context about the dataset.

Example:

"This dataset contains all invoices from 2023."

Response

Dataset created successfully

name
string
required

The human-readable name of the dataset provided during creation.

Example:

"invoices dataset"

dataset_id
string
required

The unique identifier for the dataset.

This identifier is used to refer to the dataset in API endpoints and operations.

This value is automatically generated and is unique within the organization and project context.

Example:

"dataset_12345"

created_at
string
required

The date and time when the dataset was created.

The date is in RFC 3339 format (e.g., "2023-10-01T12:00:00Z").

Example:

"2023-10-01T12:00:00Z"

I