Parse and Chunk Research Papers with Tensorlake + Chonkie

Research papers are notoriously difficult for AI systems to process: they contain dense sections, subsections, figures, tables, formulas, and references. Naïve chunking breaks sentences mid-thought, separates tables from explanations, and loses hierarchical structure. Tensorlake + Chonkie solves this by combining AI-powered parsing with intelligent chunking that respects semantic boundaries. The result: precise, context-preserving chunks that work perfectly for RAG, search, and knowledge extraction.

Run this end-to-end in Colab:

Prerequisites

Get your Tensorlake API key
Install the dependencies:

pip install tensorlake chonkie chonkie[model2vec]

Import packages and setup client

import json
from tensorlake.documentai import (
    DocumentAI,
    ParsingOptions,
    StructuredExtractionOptions,
    EnrichmentOptions,
    ChunkingStrategy,
    TableParsingFormat,
    TableOutputMode
)

# Initialize client
doc_ai = DocumentAI(TENSORLAKE_API_KEY=your_tensorlake_api_key)

Upload the document

For this example, we’ll use a research paper:

file_path = "https://tlake.link/docs/sota-research-paper"

Define the schema

Build a small JSON schema to extract what you need from each research paper.

research_paper_schema = {
    "title": "ResearchPaper",
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {"type": "array", "items": {"type": "string"}, "description": "Author names"},
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {"type": "array", "items": {"type": "string"}, "description": "Key terms"},
        "sections": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "heading": {"type": "string", "description": "Section heading"},
                    "level": {"type": "integer", "description": "Heading level (1-6)"}
                }
            }
        }
    }
}

Parse with Tensorlake

Configure parsing and structured extraction.

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.NONE,   # Parse the document in markdown format and let chonkie handle the chunking strategies
    cross_page_header_detection=True
)

structured_extraction = StructuredExtractionOptions(
    schema_name="Research Paper Analysis",
    json_schema=research_paper_schema
)

enrichment_options = EnrichmentOptions(
    figure_summarization=True,
    figure_summarization_prompt="Summarize this figure/chart in the context of this research paper, focusing on key findings and data insights.",
    table_summarization=True,
    table_summarization_prompt="Summarize this table's data and its significance to the research findings."
)

parse_id = doc_ai.parse(
    file_path,
    parsing_options=parsing_options,
    structured_extraction_options=[structured_extraction],
    enrichment_options=enrichment_options
)

print(f"Parse job submitted with ID: {parse_id}")

result = doc_ai.wait_for_completion(parse_id)

Review the Tensorlake output

You’ll receive both structured data and a single chunk for the entire document.

paper_metadata = result.structured_data[0].data if result.structured_data else {}
print(paper_metadata)

# Extract chunks with cross-page hierarchy detection and full markdown
sections_with_content = []
for chunk in result.chunks:
    print(chunk.content)
print(f"\nTensorlake extracted {len(result.chunks)} sections from the research paper!")

# Extract the tables and print out their summaries
for page in result.pages:
  for (i, fragment) in enumerate(page.page_fragments):
    if fragment.fragment_type == PageFragmentType.TABLE:
      print(f"Table {i} in page {page.page_number}: {fragment.content.summary}\n")
      print("--------------\n")

# Extract the figures and print out their summaries
for page in result.pages:
  for (i, fragment) in enumerate(page.page_fragments):
    if fragment.fragment_type == PageFragmentType.FIGURE:
      print(f"figure {i} in page {page.page_number}: {fragment.content.summary}\n")

Example output:

Structured Data
Chunks
Tables
Tables

{
    'abstract': 'A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub, and the training codes are available in this project repository.',
    'authors': ['Dun Zhang', 'Jiacheng Li', 'Ziyang Zeng', 'Fulong Wang'],
    'keywords': ['Dense Retrieval',
    'Embedding Models',
    'Knowledge Distillation',
    'Matryoshka Representation Learning',
    'Multi-Modal Learning'],
    'sections': [{'heading': 'Abstract', 'level': 2},
    {'heading': '1 Introduction', 'level': 2},
    {'heading': '2 Methods', 'level': 2},
    {'heading': '2.1 Definitions', 'level': 3},
    {'heading': '2.2 Model Architecture', 'level': 3},
    {'heading': '2.3 Stage 1&2: Distillation from Multiple Teachers', 'level': 3},
    {'heading': '2.4 Stage 3: Dimension Reduction', 'level': 3},
    {'heading': '2.5 Stage 4: Unlock Multimodal Potential', 'level': 3},
    {'heading': '3 Experiments', 'level': 2},
    {'heading': '3.1 Implementation details', 'level': 3},
    {'heading': '3.2 Datasets', 'level': 3},
    {'heading': '3.3 Results', 'level': 3},
    {'heading': '4 Discussion', 'level': 2},
    {'heading': '4.1 Instruction Robustness', 'level': 3},
    {'heading': '4.2 Possible Improvements for Vision Encoding', 'level': 3},
    {'heading': '5 Conclusion', 'level': 2},
    {'heading': 'References', 'level': 2}],
    'title': 'Jasper and Stella: distillation of SOTA embedding models'
 }

Recursive Chunking using Chonkie

Now we take Tensorlake’s rich extracted parsed data, we will now leverage Chonkie (one of the faster growing chunking library in AI Devtools) to decide how to break the text into chunks for RAG. This ensures our chunks are the perfect size for embedding models while preserving semantic boundaries.For research papers and other dense technical documents, we recommend semantic chunking. Unlike recursive chunking, which slices text by token limits, semantic chunking uses embeddings to detect boundaries where topics naturally shift. This produces higher-quality chunks that are easier to retrieve and align closely with the author’s intent.

from chonkie import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.5,
    chunk_size=1024,
    min_sentences=2,
    mode="window"
)

semantic_chunks = []
for ch in chunker.chunk(full_markdown):
    if ch.text.strip():
        semantic_chunks.append({
            "text": ch.text,
            "token_count": ch.token_count
        })

print(f"SemanticChunker produced {len(semantic_chunks)} chunks")

SemanticChunker produced 17 chunks

Review Chonkie's Semantic Chunks

Now we can review the chunks and see the quality of using Chonkie’s semantic chunking:

Chunks are semantically meaningful and complete.
Embeddings formed from these chunks will be of higher-quality since each chunk represents a cohesive unit of meaning.

For example, if we print semantic_chunks[7].text we would get:

To learn more about different chunking with Chonkie, check out our blog post “Fix Broken Context in RAG with Tensorlake + Chonkie”.

You can now leverage those chunks to embed and store them in vector DB to build a smart RAG. You can check out our Qdrant tutorial to learn more.

Next Steps

You’ve seen how Tensorlake + Chonkie make research papers easier for AI to reason over. Try the same pipeline with contracts, financial filings, or your own datasets. Have feedback or want to share what you built? Join our Slack community and show us how you’re using Tensorlake + Chonkie in your stack.

Code Snippets

Cookbooks

Tutorials

Parse and Chunk Research Papers with Tensorlake + Chonkie

Next Steps

Code Snippets

Cookbooks

Tutorials

​Next Steps

Next Steps