Having all of your documents in a single Tensorlake Dataset makes it easy to build a knowledge base from a corpus of documents and keep it continuously updated when documents are added, updated, or removed.At the moment, datasets can apply Document Ingestion actions, like document parsing, chunking or structured extraction on new documents.
from tensorlake.documentai.client import DocumentAIdoc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_API_KEY")# Create a dataset. The only required argument for a dataset is its name.# A dataset name may only contain alphanumeric characters, hyphens or underscores.## Not specifying parsing, or extraction options will create a dataset used for parsing# documents with our recommended defaults.dataset = doc_ai.create_dataset( name="your_dataset_name")
2
Parse a file
Copy
Ask AI
# Use a publicly accessible URL or upload a file to Tensorlake and use the file ID.file_url = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"parse_id = doc_ai.parse_dataset_file( dataset=dataset, file=file_url)
3
Retrieve outputs
Copy
Ask AI
# Retrieve the outputs of the parsing job.result = doc_ai.wait_for_completion(parse_id)# The result contains the parsed document and any extracted data.print(result)
The Python SDK wait_for_completion method will block until the parsing job is complete and return the result.
With datasets, you can ingest as many files as you want, and the parsing configuration will be applied to all of them. You can also create a dataset with structured extraction options, which will allow you to extract structured data from related documents.