Key Concepts
Document Ingestion
Tensorlake helps you turn unstructured documents into structured, actionable data. This guide covers the essential concepts you’ll need to understand when parsing documents and extracting data with Tensorlake.
Document AI Client
What it is: The main entry point for interacting with Tensorlake. It provides methods for uploading documents, creating parsing jobs, and retrieving results.
Why it matters: This is where you configure your parsing options, upload files, and manage the parsing workflow.
Document Upload
What it is: The first step in any ingestion workflow. Tensorlake accepts PDF, images, raw-text, presentations, and more.
Once your document (or data) is uploaded, it is considered a file
. Each file is assigned a file_id
, which is used in parsing jobs.
Why it matters: Uploading documents enables asynchronous processing and orchestration.
Parsing Jobs
What it is: A parsing job is the process Tensorlake uses to analyze a document and return structured output. It uses the
configured ParsingOptions
to determine how the document should be processed.
Why it matters: This is where you define behaviors like schema extraction, signature detection, table parsing, and more.
Parsing Options
What it is: Controls how Tensorlake parses the document. This includes chunking, table strategies, signature detection, OCR preferences, and more.
Why it matters: You can fine-tune performance and accuracy by customizing your parsing strategy.
Schemas
What it is: Schemas define what structured data you want extracted. They can include keys like buyer_name
, coverage_type
, or
signature_status
, and can be supplied as JSON or an inline string.
Why it matters: Schemas make Tensorlake deterministic. No fuzzy guesses, just structured fields mapped to your business logic.
Structured Output
What it is: The output returned by Tensorlake after parsing. Output includes a structured, schema-aligned JSON representation of your document data, including bounding boxes, page numbers, fragment types. If you provided a schema, the output will also include structured data that matches your schema.
Why it matters: This output is machine-readable, auditable, and easy to plug into downstream systems like LangGraph, Slack, or CRMs.
For example, here is a snippet based on this document, specifying the schema example above.
Visual Layout & Bounding Boxes
What it is: Each field extracted includes optional layout metadata — such as its position on the page, size, and surrounding context.
Why it matters: Useful for visual validation, audit trails, redlining, and debugging extraction behavior.
See the bounding boxes in the Playground:
And see the location of the bounding boxes for each fragmentin the structured output:
Workflows
Tensorlake Workflows are a powerful way to automate and orchestrate complex tasks. They allow you to define a series of functions that can be executed in parallel or sequentially, depending on your needs.
Graphs
Workflows are created by connecting multiple functions together in a Graph.
Graph contains:
- Node: Represents a function that operates on data.
- Start Node: which is the first function that is executed when the graph is invoked.
- Edges: Represents data flow between functions.
- Conditional Edge: Evaluates input data from the previous function and decide which edges to take. They are like if-else statements in programming.
Graphs are workflows that has functions that can be executed in parallel, while Pipelines are linear workflows that execute functions serially.
Functions
They are regular Python functions, decorated with @tensorlake_function()
decorator.
Function can be executed in a distributed manner, and the output is stored so that if downstream functions fail, they can be resumed from the output of the function.
There are various other parameters, in the decorator that can be used to configure retry behavior, placement constraints, and more.
Programming Model
Pipeline
Transforming the input of the graph so that every node transforms the output of the previous node until reaching the end node.
Use Cases: Transforming a video into text by first extracting the audio, and then doing Automatic Speech Recognition (ASR) on the extracted audio.
Parallel Branching
Generating more than one graph output for the same graph input in parallel.
Use Cases: Extracting embeddings and structured data from the same unstructured data.
Map
Automatically parallelize functions across multiple machines when a function returns a sequence and the downstream function accepts only a single element of that sequence.
Use Cases: Generating Embedding from every single chunk of a document.
Map Reduce - Reducing/Accumulating from Sequences
Reduce functions in Tensorlake Serverless aggregate outputs from one or more functions that return sequences. They operate with the following characteristics:
- Lazy Evaluation: Reduce functions are invoked incrementally as elements become available for aggregation. This allows for efficient processing of large datasets or streams of data.
- Stateful Aggregation: The aggregated value is persisted between invocations. Each time the Reduce function is called, it receives the current aggregated state along with the new element to be processed.
Use Cases: Aggregating a summary from hundreds of web pages.
Dynamic Routing
Functions can route data to different nodes based on custom logic, enabling dynamic branching.
Use Cases: Processing outputs differently based on classification results.