Dataset makes it easy to build a knowledge base from a corpus of documents and keep it continuously updated when documents are added, updated, or removed.

At the moment, datasets can apply Document Ingestion actions, like document parsing, chunking or structured extraction on new documents.

All the code examples on this page use the official Tensorlake SDK. For other languages, please consult our API Reference.

Quick Start

1

Create a dataset

import json
from pydantic import BaseModel, Field
from tensorlake.documentai import DocumentAI, DatasetOptions, ExtractionOptions

class LoanSchema(BaseModel):
  account_number: str = Field(description="Account number of the customer")
  customer_name: str = Field(description="Name of the customer")
  amount_due: str = Field(description="Total amount due in the current statement")
  due_date: str = Field(description="Due Date")

doc_ai = DocumentAI(api_key="xxxx")

options = DatasetOptions(
  name="DatasetName",
  extraction_options=ExtractionOptions(
    json_schema=json.dumps(LoanSchema.model_json_schema())
  )
)

dataset = doc_ai.create_dataset('DatasetName', options)
2

Ingest files

from tensorlake.documentai import IngestArgs

job = dataset.ingest(IngestArgs(file_path="/path/to/file"))

A dataset can be extended with files from your local file system, publicly accessible documents via HTTP, or existing Tensorlake files (which start with tensorlake-).

3

Retrieve outputs

items = {}
items_page = dataset.items()
for key_info, data in items_page.items.items():
    items[key_info] = data.model_dump_json()

cursor = items_page.cursor
while cursor is not None:
    items_page = dataset.items(cursor=cursor)
    for key_info, data in items_page.items.items():
        items[key_info] = data.model_dump_json()
    cursor = items_page.cursor

Every API in the datasets SDK supports asynchronous operations.

When creating a dataset, you need to provide either an extract_settings or a parse_settings object in the creation request payload.

For creating a document parsing dataset, consult the Parse API reference.

For creating a structured extraction dataset, consult the Extract API reference.

Datasets API reference

URL: https://api.tensorlake.ai/documents/v1/datasets

Name

Attribute: name.

The name of the dataset. It must be unique.

Description

Attribute: description.

A description of the dataset. This attribute is optional.

Extraction Options

Attribute: extractSettings.

The extraction settings for the dataset. This is a JSON object.

The schema you provide is used to extract structured data from the document. The schema is defined as JSON Schema.

For a list of all the attributes in the extraction settings, consult the Extraction API reference.

Parse Options

Attribute: parseSettings.

The parse settings for the dataset. This is a JSON object.

The parse settings are used to parse the document into markdown or JSON.

For a list of all the attributes in the parse settings, consult the Parse API reference.