The most basic use-cases of Document Ingestion API are:

  • Convert the Document to Markdown for feeding into an LLM.
  • Extract structured data from the document specified by a JSON schema.

You will learn how to convert a rental agreement document to markdown chunks, and extract structured data from the document specified by a schema.

Prerequisites

  • Python 3.10+
  • A Tensorlake API key

1: Install the SDK

pip install tensorlake

2: Set Your API Key

Export the variable:

export TENSORLAKE_API_KEY=your-api-key-here

3: Parse a Document

quickstart.py
import os
import json
from pydantic import BaseModel, Field, Optional
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.parse import ParsingOptions, ExtractionOptions, ChunkingStrategy 

doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))

file_id = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"

class AgreementDetails(BaseModel):
    buyer_name: Optional[str] = Field(description="The name of the buyer")
    buyer_signature_date: Optional[str] = Field(description="Date and time that the buyer signed.")
    seller_name: Optional[str] = Field(description="The name of the seller")
    seller_signature_date: Optional[str] = Field(description="Date and time that the seller signed.")

options = ParsingOptions(
    detect_signature=True,
    chunking_strategy=ChunkingStrategy.PAGE,
    extraction_options=ExtractionOptions(
        schema=AgreementDetails
    ),
)

job_id = doc_ai.parse(file_id, options)

4: Wait for the job to complete

quickstart.py
result = doc_ai.wait_for_completion(job_id)

5: Use the results

quickstart.py
structured_data = result.outputs.structured_data
markdown_chunks = result.outputs.chunks


with open("structured_data.json", "w") as f:
    json.dump(structured_data.model_dump(), f, indent=2)

with open("markdown_chunks.md", "w") as f:
    for chunk_number, chunk in enumerate(markdown_chunks):
        f.write(f"## CHUNK NUMBER {chunk_number}\n\n")
        f.write(f"## Page {chunk.page_number}\n\n{chunk.content}\n\n")

What You’ll See

If you run the script, you will see -

  • Two files called structured_data.json and markdown_chunks.md with the structure data and the markdown chunks.
structured_data.json
{
  "pages": [
    {
      "page_number": 0,
      "data": {
        "buyer": {
          "buyer_name": "Nova Ellison",
          "buyer_signature_date": "September 10, 2025",
        },
        "seller": {
          "seller_name": "J uno Vega",
          "seller_signature_date": "September 10, 2025",
        }
      }
    }
  ]
}

Next Steps