Document Ingestion Overview
Document Ingestion with Tensorlake is as simple as:
Upload your file
Either upload your file to the Tensorlake Cloud, or specify URL to a publicly accessible file.
Specify your parsing settings and parse
Define your parsing settings, including which pages to parse, what chunking strategy to use, what structured data you want extracted, and how to handle complex document fragments like tables, figures, signatures, or strikethrough.
Get the results
Retrieve the results of the parse job, including markdown chunks, a complete document layout, and structured data if a schema was provided.
The APIs to support this workflow are:
Files
File Management endpoints to upload, list, and delete files.
Parse
Parse endpoints to parse uploaded Documents or any remote file.
Core API Functionality
While the Tensorlake API is extensive, some of the core functinality that sets it apart from other Document Ingestion APIs are:
Core Function | Description |
---|---|
Structured Data Extraction | Pull out fields from a document. Specify schema using either JSON Schema or Pydantic Models. |
Page Classification | Automatically identify and label different sections or types of pages (e.g., cover, table of contents, appendix) within a document. |
Document Chunking | Enable Agents to read documents or index chunks for building RAG and Knowledge Graphs applications. |
Bounding Boxes | Specifically reference every element in the document for citations and highlighting. |
Summarization | Summarize tables, charts and figures in documents. |
Unlimited Pages and File Size | Parse any number of large documents. You pay only for what you use |
Unlimited Fields Per Document | Capture every detail in even the most complex documents. Don’t be limited to only ~100 fields by using other APIs |
Flexible Usage | All of the above features can be consumed individually or in combination in a single API call, thereby not requiring you to not need to build custom multi-stage document parsing pipelines. |