Tensorlake helps you turn unstructured documents into structured, actionable data through our Document Ingestion API. Continue reading to learn about the essential concepts and functionality that enables you to parse documents and extract data with Tensorlake.

Document AI Client

What it is: The main entry point for interacting with Tensorlake. It provides methods for uploading documents, creating parsing jobs, and retrieving results. Why it matters: This is where you configure your parsing options, upload files, and manage the parsing workflow.
from tensorlake.documentai import DocumentAI

API_KEY="tl__apiKey_xxxx"
doc_ai = DocumentAI(api_key=API_KEY)
Learn how to get your API key from Tensorlake Cloud.

Document Upload

What it is: The first step in any ingestion workflow. Tensorlake accepts PDF, images, raw-text, presentations, and more. Once your document (or data) is uploaded, it is considered a file. Each file is assigned a file_id, which is used in parsing jobs. Why it matters: Uploading documents enables asynchronous processing and orchestration.
file_id = doc_ai.upload(path="/path/to/file.pdf")

Parsing Jobs

What it is: A parsing job is the process Tensorlake uses to analyze a document and return structured output. It uses the configured ParsingOptions to determine how the document should be processed. Why it matters: This is where you define behaviors like schema extraction, signature detection, table parsing, and more.
job_id = doc_ai.parse(file_id, ParsingOptions())

Parsing Options

What it is: Controls how Tensorlake parses the document. This includes chunking, table strategies, signature detection, OCR preferences, and more. Why it matters: You can fine-tune performance and accuracy by customizing your parsing strategy.
options = ParsingOptions(
    page_range='1',
    chunk_strategy=ChunkingStrategy.NONE,
    table_parsing_strategy=TableParsingStrategy.TSR,
    table_output_mode=TableOutputMode.MARKDOWN,
    form_detection_mode=FormDetectionMode.VLM,
    table_summarization=True,
    extraction_option=ExtractionOptions(
        skip_ocr=True,
    )
)
Learn more about Parsing Options, including Signature Detection, Strikethrough Detection, and Table Parsing.

Schemas

What it is: Schemas define what structured data you want extracted. They can include keys like buyer_name, coverage_type, or signature_status, and can be supplied as JSON or an inline string. Why it matters: Schemas make Tensorlake deterministic. No fuzzy guesses, just structured fields mapped to your business logic.
signature_status_schema.json
{
  "buyer": {
    "buyer_name": "string",
    "buyer_signed": {
        "description": "Determine if the buyer signed the agreement",
        "type": "boolean"
    }
  },
  "seller": {
    "seller_name": "string",
    "seller_signed": {
        "description": "Determine if the seller signed the agreement",
        "type": "boolean"
    }
  }
}
Learn how to define schemas here.

Structured Output

What it is: The output returned by Tensorlake after parsing. Output includes a structured, schema-aligned JSON representation of your document data, including bounding boxes, page numbers, fragment types. If you provided a schema, the output will also include structured data that matches your schema. Why it matters: This output is machine-readable, auditable, and easy to plug into downstream systems like LangGraph, Slack, or CRMs. For example, here is a snippet based on this document, specifying the schema example above.
{
  "id": "job-***",
  "status": "successful",
  "file_name": "file_name.pdf",
  "file_id": "tensorlake-***",
  "trace_id": "***",
  "createdAt": null,
  "updatedAt": null,
  "outputs": {
    "chunks": [
      {
        "page_number": 0,
        "content": "Full text of the document as markdown, broken down by page"
      }
    ],
    "document": {
      "pages": [
        {
          "page_number": 1,
          "page_fragments": [
            {...}
            {
              "fragment_type": "text",
              "content": {
                "content": "XXIV. GOVERNING LAW. This Agreement shall be interpreted in accordance with the laws in the state of California (\"Governing Law\")."
              },
              "reading_order": null,
              "page_number": null,
              "bbox": {
                "x1": 71.0,
                "x2": 527.0,
                "y1": 238.0,
                "y2": 264.0
              }
            },
            {...}
          ],
          "layout": {}
        },
      ]
    },
    "num_pages": 10,
    "structured_data": {
      "pages": [
        {
          "page_number": 1,
          "data": {
            "buyer": {
              "buyer_name": "Nova Ellison",
              "buyer_signature_date": "September 10, 2025",
              "buyer_signed": true
            },
            "seller": {
              "seller_name": "Juno Vega",
              "seller_signature_date": "September 10, 2025",
              "seller_signed": true
            }
          }
        }
      ]
    },
    "error_message": ""
  }
}

Visual Layout & Bounding Boxes

What it is: Each field extracted includes optional layout metadata — such as its position on the page, size, and surrounding context. Why it matters: Useful for visual validation, audit trails, redlining, and debugging extraction behavior. See the bounding boxes in the Playground: Bounding boxes on the document in the Playground And see the location of the bounding boxes for each fragmentin the structured output:
{
    "fragment_type": "text",
    "content": {
        "content": "XXIV. GOVERNING LAW."
    },
    "reading_order": null,
    "page_number": null,
    "bbox": {
    "x1": 71.0,
    "x2": 527.0,
    "y1": 238.0,
    "y2": 264.0
    }
}