The core workflow of Document Ingestion is:

  1. Upload a file to the Tensorlake Cloud.
  2. Parse the file using the Parse Endpoint.
  3. Retrieve the results of the parse job.
  4. Delete the files, and job results after you have retrieved the results.

The APIs to support this workflow are:

Key Features

  • Structured Data Extraction to pull out fields from a document. Specify schema using either JSON Schema or Pydantic Models.
  • Chunking Documents to enable Agents to read documents, index them for building RAG and Knowledge Graphs applications.
  • Get Bounding Boxes of every element in the document for citations and highlighting.
  • Summarize tables, charts and figures in documents.
  • Detect and get coordinates of signatures in documents.
  • Supports unlimited number of pages in a document, and unlimited file sizes.
  • Unlimited number of fields per document, unlike other structured extraction APIs limiting to 100 or so fields.
  • All of the above features can be consumed individually or in combination in a single API call, thereby not requiring you to not need to build custom multi-stage document parsing pipelines.

Supported File Types

Tensorlake supports the following file types -

  • PDF
  • Images (PNG and JPG)
  • Text files (TXT, XML, CSV)
  • Office Documents (DOCX, XLSX, PPTX, Keynote)
  • Web Pages (HTML)