Create Dataset

cURL

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/datasets \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ]
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null
  },
  "name": "invoices dataset",
  "description": "This dataset contains all invoices from 2023."
}'

{
  "name": "invoices dataset",
  "dataset_id": "dataset_12345",
  "created_at": "2023-10-01T12:00:00Z"
}

POST

documents

datasets

cURL

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/datasets \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ]
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null
  },
  "name": "invoices dataset",
  "description": "This dataset contains all invoices from 2023."
}'

{
  "name": "invoices dataset",
  "dataset_id": "dataset_12345",
  "created_at": "2023-10-01T12:00:00Z"
}

Create an ingestion workflow for structured extraction or document parsing. A dataset is a collection of settings that help with organizing documents from the same domain and enable focused document intelligence. The dataset’s name must be unique. Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document. To read more about the configuration options, see the Parse Documents endpoint.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

This object defines the request body for creating a new dataset.

A Dataset is a collection of parsed results from files.

It can be used to store and manage related data, such as invoices, receipts, or any other documents that need to be parsed and analyzed.

Once a dataset is created, you can use it to parse related files using the same configuration and options, allowing for consistent and efficient data extraction.

Response

200

application/json

Dataset created successfully

The response is of type object.

Get Parse Result Parse with Dataset

API Documentation

Document Ingestion

Authorizations

Body

Response