Datasets
API for creating datasets from a corpus of documents
Dataset makes it easy to build a knowledge base from a corpus of documents and keep it continuously updated when documents are added, updated, or removed.
At the moment, datasets can apply Document Ingestion actions, like document parsing, chunking or structured extraction on new documents.
Quick Start
Create a dataset
Ingest files
A dataset can be extended with files from your local file system, publicly accessible documents via HTTP, or existing Tensorlake files (which start with tensorlake-
).
Retrieve outputs
Every API in the datasets SDK supports asynchronous operations.
When creating a dataset, you need to provide either an extract_settings
or a parse_settings
object in the creation request payload.
For creating a document parsing dataset, consult the Parse API reference.
For creating a structured extraction dataset, consult the Extract API reference.
Datasets API reference
URL: https://api.tensorlake.ai/documents/v1/datasets
Name
Attribute: name
.
The name of the dataset. It must be unique.
Description
Attribute: description
.
A description of the dataset. This attribute is optional.
Extraction Options
Attribute: extractSettings
.
The extraction settings for the dataset. This is a JSON object.
The schema you provide is used to extract structured data from the document. The schema is defined as JSON Schema.
For a list of all the attributes in the extraction settings, consult the Extraction API reference.
Parse Options
Attribute: parseSettings
.
The parse settings for the dataset. This is a JSON object.
The parse settings are used to parse the document into markdown or JSON.
For a list of all the attributes in the parse settings, consult the Parse API reference.
Was this page helpful?