cURL
Python
JavaScript
PHP
Go
Java
curl --request POST \
--url https://api.tensorlake.ai/documents/v1/datasets \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"description": "<string>",
"extractSettings": null,
"name": "<string>",
"parseSettings": null,
"settings": null
}'
Create an ingestion workflow for structured extraction or document parsing.
A dataset is a collection of settings that help with organizing documents from the same domain and enable focused
document intelligence.
The dataset’s name must be unique.
Your data is NOT sent to a third party service(OpenAI, Anthropic, etc), and uses our own models to parse the document.
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
extractSettings. jsonSchema
The JSON schema to guide structured data extraction from the file.
Encode the JSON schema as a string.
extractSettings. formDetectionMode
Available options:
vlm
,
object_detection
extractSettings. modelProvider
The model provider to use for structured data extraction.
Specifying tensorlake
will use a private model, which runs on our servers.
Available options:
tensorlake
,
claude-3-5-sonnet-latest
,
gpt-4o-mini
Overide the prompt to customize structured extractions. Use this if you
want to extract data from a file using a different prompt than the one
we use to extract.
extractSettings. tableParsingStrategy
Table Structure Understanding is the default mode. It's great for structured tables.
VLM is great for unstructured or semi-structured tables.
Available options:
tsr
,
vlm
parseSettings. chunkStrategy
The chunking strategy determines how the document is chunked into smaller pieces.
This is only supported in markdown
mode.
Available options:
page
,
section
,
fragment
parseSettings. figureSummarization
Whether to summarize the contents of the figures.
parseSettings. figureSummarizationPrompt
The prompt to use for figure summarization.
parseSettings. formDetectionMode
Available options:
vlm
,
object_detection
The output mode determines the format of the output.
json
mode has the individual page elements and their bounding boxes. JSON mode also includes any images in the document, encoded as base64.
markdown
mode converts the document into a markdown format. Images are summarized as text. The original images are not included in the output.
Available options:
json
,
markdown
parseSettings. tableOutputMode
The mode to use for table output - JSON, Markdown or HTML
JSON mode is great for structured tables.
Markdown mode is great for tables without merged cells.
HTML mode is great for tables with merged cells.
Available options:
json
,
markdown
,
html
parseSettings. tableParsingStrategy
The mode to use for table parsing - Table Structure Understanding or VLM
Table Structure Understanding is the default mode. It's great for structured tables.
VLM is great for unstructured or semi-structured tables.
Available options:
tsr
,
vlm
parseSettings. tableSummarization
Whether to summarize the contents of the tables.
parseSettings. tableSummarizationPrompt
The prompt to use for table summarization.
The chunking strategy determines how the document is chunked into smaller pieces.
This is only supported in markdown
mode.
Available options:
page
,
section
,
fragment
settings. figureSummarizationPrompt
The prompt to use for figure summarization.
settings. formDetectionMode
Whether to summarize the contents of the tables.
Available options:
vlm
,
object_detection
The JSON schema to guide structured data extraction from the file.
Encode the JSON schema as a string.
If provided, the pages argument will be ignored.
The model provider to use for structured data extraction.
Specifying tensorlake
will use a private model, which runs on our servers.
Available options:
tensorlake
,
claude-3-5-sonnet-latest
,
gpt-4o-mini
settings. structuredExtractionPrompt
Overide the prompt to customize structured extractions. Use this if you
want to extract data from a file using a different prompt than the one
we use to extract.
The mode to use for table output - JSON, Markdown or HTML
JSON mode is great for structured tables.
Markdown mode is great for tables without merged cells.
HTML mode is great for tables with merged cells.
Available options:
json
,
markdown
,
html
settings. tableParsingMode
The mode to use for table parsing - Table Structure Understanding or VLM
Table Structure Understanding is the default mode. It's great for structured tables.
VLM is great for unstructured or semi-structured tables.
Available options:
tsr
,
vlm
settings. tableSummarizationPrompt
The prompt to use for table summarization.
Create a new dataset. Reference the name to insert structured data from documents extracted by the Structured Extraction API automatically.