Build Smart Document Understanding Agents with TensorLake and OpenAI Agent SDK

Agentic applications rely on accurate, structured inputs to make decisions. Tensorlake enables this by extracting structured document data that agents can reason over—no guesswork or prompting required. This turns your documents into action-ready inputs for any AI Agent framework.

Parse research papers and ask natural language questions about them with OpenAI and Tensorlake using this notebook:

Here is an example of how to use the Tensorlake Python SDK to parse a research paper and extract key information. With this information, build AI Agents that have more detailed and accurate information ready for natural language query engagement.

Prerequisites

Get your Tensorlake API key
Install the Tensorlake SDK with pip install tensorlake

Import packages, setup client, and define file path

import json
from pydantic import BaseModel

from tensorlake.documentai import ( 
        DocumentAI, 
        ParseStatus, 
        ParsingOptions,
        StructuredExtractionOptions
      )

# Initialize the client and add your API Key
doc_ai = DocumentAI()

# Upload the document
file_path = "https://tlake.link/docs/research-paper"

Define the schema

Define a Pydantic model to extract relevant information from the research paper.

class ResearchPaperSchema(BaseModel):
    """Schema focusing on the most critical information from the research papers"""

    title: str = Field(description="Title of the research paper")
    authors: List[str] = Field(description="List of author names")
    abstract: str = Field(description="Abstract of the paper")

    research_problem: str = Field(description="What problem does this paper solve?")
    main_approach: str = Field(description="What is the main approach or method used?")
    key_contributions: List[str] = Field(description="What are the 3-5 most important contributions?")

    methodology_summary: str = Field(description="Brief summary of the research methodology")
    datasets_used: Optional[List[str]] = Field(description="Datasets mentioned in the paper", default=None)
    evaluation_metrics: Optional[List[str]] = Field(description="How do they measure success?", default=None)

    related_work_summary: Optional[str] = Field(description="Brief summary of how this relates to existing work", default=None)
    limitations: Optional[List[str]] = Field(description="What limitations do the authors acknowledge?", default=None)

Parse the document with the Python SDK

# Configure parsing options for academic papers
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.PAGE
)

# Configure structured extraction
structured_extraction_options = StructuredExtractionOptions(
    schema_name="Research Paper Analysis",
    json_schema=ResearchPaperSchema
)

# Parse the document with the specified extraction options
parse_id = doc_ai.parse(file_path, parsing_options=parsing_options, structured_extraction_options=[structured_extraction_options])

print(f"Parse job submitted with ID: {parse_id}")

# Wait for completion
result = doc_ai.wait_for_completion(parse_id)

Review the output

The result will include the extracted data, all of the markdown chunks, and the entire document layout.

# Print the structured data extracted
print(json.dumps(result.structured_data[0].data, indent=2))

# Get all of the markdown chunks (by page)
for index, chunk in enumerate(result.chunks):
    print(f"Chunk {index}:")
    print(chunk.content)

The output will be:

Structured Data

{
    "abstract": "A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving an average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub, and the training codes are available in this project repository.",
    "authors": [
        "Dun Zhang",
        "Jiacheng Li",
        "Ziyang Zeng",
        "Fulong Wang"
    ],
    "datasets_used": [
        "sentence-transformers/embedding-training-data",
        "BAAI/Infinity-MM"
    ],
        "evaluation_metrics": [
        "average score on MTEB leaderboard across 56 datasets"
    ],
    "key_contributions": [
        "Propose a novel multi-stage distillation framework for reducing model size without significantly losing performance.",
        "Develop the Jasper model with 2 billion parameters that perform comparably to models with 7 billion parameters.",
        "Use Matryoshka Representation Learning (MRL) to reduce vector dimensionality efficiently.",
        "Publication of three tailored loss functions to enhance distillation learning.",
        "Release of model and data on Hugging Face Hub."
    ],
    "limitations": [
        "The paper does not conduct experiments to evaluate the proposed approach for self-distillation in detail.",
        "Stage 4 only achieves preliminary alignment between text and image modalities, indicating room for improvement."
    ],
    "main_approach": "The main approach is a multi-stage distillation framework that involves distilling information from larger teacher models to a smaller student model using three specific loss functions, combined with Matryoshka Representation Learning (MRL) for dimensionality reduction.",
    "methodology_summary": "The methodology involves a four-stage distillation process where a smaller student model distills information from larger teacher models using specifically designed loss functions to learn effective text representations while employing MRL for dimensionality reduction. Subsequent stages focus on enhanced dimension reduction and unlocking multimodal potential through incorporating vision encodings.",
    "related_work_summary": "The paper builds on existing dense retrieval and knowledge distillation methodologies, emphasizing enhanced retrieval training efficiency and effectiveness, with references to prior works on knowledge distillation and representation learning.",
    "research_problem": "The paper addresses the challenge of deploying high-performing dense retrieval models with large parameters and vector dimensions in practical applications by proposing a distillation framework to reduce model size while maintaining performance.",
    "title": "Jasper and Stella: distillation of SOTA embedding models"
}

With Tensorlake parse output you have accurate, detailed, and precise data that is much more useful for AI Agents compared to only providing the agent with a PDF.

Cookbooks

Tutorials

Build Smart Document Understanding Agents with TensorLake and OpenAI Agent SDK