Retrieve Dataset Data

A Tensorlake Dataset is a collection of parsed results from documents that were parsed using the options defined by the Dataset. You can retrieve the parsed result data stored in a Dataset using the /datasets/{dataset_id}/data endpoint.

from tensorlake.documentai.client import (
  DocumentAI,
  ParseStatus,
)

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_API_KEY")

# If you don't know your dataset ID, you can go to https://cloud.tensorlake.ai
# and find it in the Datasets section.
dataset = doc_ai.get_dataset("your_dataset_id")

dataset_data = doc_ai.get_dataset_data(dataset)

for parsed_result in dataset_data.items
    print(f"Parse ID: {parsed_result.parse_id}")

    if parsed_result.status == ParseStatus.SUCCESSFUL:
        print("Parsed Document:")
        print(parsed_result.document)
    else:
        print(f"Parse Status: {parsed_result.status}")

Dataset data is returned as a paginated list of results from parse jobs initiated via the Dataset. Each item in the list follows the same structure as the parse results from regular parse jobs. Both the API and the Python SDK use cursor-based pagination to retrieve the Dataset data. The response will include a next_cursor field that you can use to retrieve the next page of results.

Filtering Dataset Data

The /datasets/{dataset_id}/data endpoint supports filtering the Dataset data by various parameters. You can filter by:

status: Filter by the status of the parse job (e.g., Pending, Processing, Successful, Failure).
file_name: Filter by the name of the file that was parsed. This may not be available if the file used was not a file uploaded to Tensorlake (e.g. if you used a file_url or raw_text).
created_after: Filter by an inclusive date after which the parse job was created. Date should be in RFC 3339 format (e.g., 2023-10-01T00:00:00Z).
created_before: Filter by an inclusive date before which the parse job was created. Date should be in RFC 3339 format (e.g., 2023-10-01T00:00:00Z).
finished_after: Filter by an inclusive date after which the parse job was finished. Date should be in RFC 3339 format (e.g., 2023-10-01T00:00:00Z).
finished_before: Filter by an inclusive date before which the parse job was finished. Date should be in RFC 3339 format (e.g., 2023-10-01T00:00:00Z).

Tensorlake

Document Ingestion

Webhooks

Workflows

FAQ

Open Source

Retrieve Dataset Data

Filtering Dataset Data

Tensorlake

Document Ingestion

Webhooks

Workflows

FAQ

Open Source

​Filtering Dataset Data

Filtering Dataset Data