Create Dataset

Create an ingestion workflow for structured extraction or document parsing. A dataset is a collection of settings that help with organizing documents from the same domain and enable focused document intelligence. The dataset’s name must be unique. Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document. To read more about the configuration options, see the Parse Documents endpoint.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

This object defines the request body for creating a new dataset.

A Dataset is a collection of parsed results from files.

It can be used to store and manage related data, such as invoices, receipts, or any other documents that need to be parsed and analyzed.

Once a dataset is created, you can use it to parse related files using the same configuration and options, allowing for consistent and efficient data extraction.

name

string

required

The name of the dataset.

The name can only contain alphanumeric characters, hyphens, and underscores.

The name must be unique within the organization and project context.

Example:

"invoices dataset"

parsing_options

object

The properties of this object define the configuration for the document parsing process.

Tensorlake provides sane defaults that work well for most documents, so this object is not required. However, every document is different, and you may want to customize the parsing process to better suit your needs.

Show child attributes

parsing_options.table_output_mode

enum<string>

default:html

The format for the tables extracted from the document.

HTML - tables are represented as HTML strings. Markdown - tables are represented as Markdown strings.

The default is HTML.

Available options:

html,

markdown

parsing_options.table_parsing_format

enum<string>

default:tsr

Determines which model the system uses to identify and extract tables from the document.

tsr - identifies the structure of the table first, and then the cells of the tables. Better suited for dense, long or grid-like tables. vlm - uses a VLM model to identify and extract the cells of the tables. Better suited for tables with merged cells or irregular structures.

The default is tsr.

Available options:

tsr,

vlm

parsing_options.chunking_strategy

enum<string>

default:none

Determines how the document is chunked into smaller pieces.

None - no chunking is applied. Page - chunks the document into pages. Section - chunks the document into sections. Fragment - chunks the document by objects detected in the document. Every text block, image, table, etc. is considered a fragment.

The default is None.

Available options:

none,

page,

section,

fragment

parsing_options.signature_detection

boolean

default:false

Flag to enable the detection of signatures in the document.

This flag incurs additional billing costs.

The default is false.

parsing_options.remove_strikethrough_lines

boolean

default:false

Flag to enable the detection, and removal, of strikethrough text in the document.

This flag incurs additional billing costs.

The default is false.

parsing_options.skew_detection

boolean

default:false

Boolean flag to detect and correct skewed or rotated pages in the document.

Setting this to true will increase the processing time of the document.

The default is false.

parsing_options.disable_layout_detection

boolean

default:false

Disable bounding box detection for the document. Leads to faster document parsing.

The default is false.

parsing_options.ignore_sections

enum<string>[]

A set of page fragment types to ignore during parsing.

This can be used to skip certain types of content that are not relevant for the parsing process, such as headers, footers, or other non-essential elements.

The default is an empty set.

Available options:

section_header,

title,

text,

table,

figure,

formula,

form,

key_value_region,

document_index,

list_item,

table_caption,

figure_caption,

formula_caption,

page_footer,

page_header,

page_number,

signature,

strikethrough,

tracked_changes,

comments

parsing_options.cross_page_header_detection

boolean

default:false

Enable header-hierarchy detection across pages.

When set to true, the parser will consider headers from different pages when determining the hierarchy of headers within a single page.

The default is false.

parsing_options.ocr_model

enum<string>

default:model01

The model to use for OCR (Optical Character Recognition).

The default is model01 - It's fast but could have lower accuracy on complex tables. It's good for legal documents with footnotes. model02 - It's slower but could have higher accuracy on complex tables. It's good for financial documents with merged cells. model03 - A compact model that we deliver to on-premise users. It takes about 2 minutes to startup on Tensorlake's Cloud because it's meant for testing for users who are eventually going to deploy this model on dedicated hardware in their own datacenter.

Available options:

model01,

model02,

model03,

gemini3,

model06

structured_extraction_options

object[] | null

The properties of this object define the configuration for structured data extraction.

If this object is present, the API will perform structured data extraction on the document.

Show child attributes

structured_extraction_options.schema_name

string

required

The name of the schema. This is used to tag the structured data output with a name in the response.

structured_extraction_options.json_schema

any

required

The JSON schema to guide structured data extraction from the file.

This schema should be a valid JSON schema that defines the structure of the data to be extracted.

The API supports a subset of the JSON schema specification.

This value must be provided if structured_extraction is present in the request.

structured_extraction_options.skip_ocr

boolean

Boolean flag to skip converting the document blob to OCR text before structured data extraction.

If set to true, the API will skip the OCR step and directly extract structured data from the document.

The default is false.

structured_extraction_options.prompt

string | null

The prompt to use for structured data extraction.

If not provided, the default prompt will be used.

structured_extraction_options.model_provider

enum<string>

The model provider to use for structured data extraction.

The default is tensorlake, which uses our private model, and runs on our servers.

Available options:

tensorlake,

gemini,

sonnet,

gpt4o_mini

structured_extraction_options.partition_strategy

Strategy to partition the document before structured data extraction. The API will return one structured data object per partition. This is useful when you want to extract certain fields from every page.

Options -

None(default) - no partitioning is applied.
Page - partition the document into pages.
Section - partition the document into sections. A section is defined as a group of text blocks that are visually separated from other text blocks by whitespace or other visual elements.
Fragment - partition the document by fragments. A fragment is defined as a group of text blocks, images, tables, etc. that are visually grouped together.
Patterns - partition the document by custom patterns. This requires providing start_patterns and end_patterns to define the custom patterns. Patterns are defined as strings specific to the document content. The start_patterns and end_patterns are used to identify the beginning and end of each partition.

Available options:

none

structured_extraction_options.page_classes

string[] | null

Filter the pages of the document to be used for structured data extraction by providing a list of page classes.

The default is None, which means all pages will be used.

structured_extraction_options.provide_citations

boolean | null

Flag to enable visual citations in the structured data output. It returns the bounding boxes of the coordinates of the document where the structured data was extracted from.

The default is false.

page_classifications

object[] | null

The properties of this object define the configuration for page classify.

If this object is present, the API will perform page classify on the document.

Show child attributes

page_classifications.name

string

required

The name of the page class.

page_classifications.description

string

required

The description of the page class to guide the model to classify the pages. Describe what the model should look for in the page to classify it.

enrichment_options

object

The properties of this object help to extend the output of the document parsing process with additional information.

This includes summarization of tables and figures, which can help to provide a more comprehensive understanding of the document.

This object is not required, and the API will use default settings if it is not present.

Show child attributes

enrichment_options.table_summarization

boolean

default:false

Generate a summary for parsed tables.

The default is false.

enrichment_options.table_summarization_prompt

string | null

The prompt to guide the table summarization. Ignored if table_summarization is false. Default prompt - "Summarize the table in a concise manner."

enrichment_options.figure_summarization

boolean

default:false

Generate a summary for parsed figures.

The default is false.

enrichment_options.figure_summarization_prompt

string | null

The prompt to guide the figure summarization. Ignored if figure_summarization is false. Default prompt - "Summarize the figure in a concise manner."

enrichment_options.include_full_page_image

boolean

default:false

Use full page image in addition to the cropped table and figure images. This provides Language Models context about the table and figure they are summarizing in addition to the cropped images, and could improve the summarization quality.

The default is false.

description

string | null

A description of the dataset.

This field is optional and can be used to provide additional context about the dataset.

Example:

"This dataset contains all invoices from 2023."

Response

Dataset created successfully

name

string

required

The human-readable name of the dataset provided during creation.

Example:

"invoices dataset"

dataset_id

string

required

The unique identifier for the dataset.

This identifier is used to refer to the dataset in API endpoints and operations.

This value is automatically generated and is unique within the organization and project context.

Example:

"dataset_12345"

created_at

string

required

The date and time when the dataset was created.

The date is in RFC 3339 format (e.g., "2023-10-01T12:00:00Z").

Example:

"2023-10-01T12:00:00Z"

API Documentation

Document Ingestion

Authorizations

Body

Response