Document Ingestion
Document Ingestion Overview
The core workflow of Document Ingestion is:
- Upload a file to the Tensorlake Cloud.
- Parse the file using the Parse Endpoint.
- Retrieve the results of the parse job.
- Delete the files, and job results after you have retrieved the results.
The APIs to support this workflow are:
Files
File Management endpoints to upload, list, and delete files.
Parse
Endpoint to parse uploaded Documents or any remote file.
Webhooks
Webhooks to receive notifications when a parse job is completed.
Key Features
- Structured Data Extraction to pull out fields from a document. Specify schema using either JSON Schema or Pydantic Models.
- Chunking Documents to enable Agents to read documents, index them for building RAG and Knowledge Graphs applications.
- Get Bounding Boxes of every element in the document for citations and highlighting.
- Summarize tables, charts and figures in documents.
- Detect and get coordinates of signatures in documents.
- Supports unlimited number of pages in a document, and unlimited file sizes.
- Unlimited number of fields per document, unlike other structured extraction APIs limiting to 100 or so fields.
- All of the above features can be consumed individually or in combination in a single API call, thereby not requiring you to not need to build custom multi-stage document parsing pipelines.
Supported File Types
Tensorlake supports the following file types -
- Images (PNG and JPG)
- Text files (TXT, XML, CSV)
- Office Documents (DOCX, XLSX, PPTX, Keynote)
- Web Pages (HTML)