Skip to main content
Chonkie is a fast, lightweight chunking library that uses embeddings to detect natural topic boundaries. When combined with Tensorlake’s document parsing, you get intelligent chunking that respects semantic structure. Perfect for research papers, technical documentation, and dense content. Combining Chonkie and Tensorlake eliminates broken context in RAG systems where naive chunking splits thoughts mid-sentence or separates tables from explanations.
Run this end-to-end in Colab:

Why Use Tensorlake + Chonkie?

The Problem:
  • Fixed-size chunking breaks sentences mid-thought and splits tables from context
  • Token-based splitting ignores document structure (sections, subsections, figures)
  • Chunks lose hierarchical meaning, leading to poor retrieval in RAG
  • Dense technical documents need semantic boundaries, not arbitrary character limits
The Solution: Tensorlake preserves document structure during parsing. Chonkie uses embeddings to detect where topics naturally shift. Together, they produce context-preserving chunks that align with the author’s intent. Key Benefits:
  • Semantic boundaries - Chunks align with natural topic transitions, not arbitrary limits
  • Structure preservation - Respect sections, subsections, and hierarchical organization
  • Better retrieval - Embeddings capture complete thoughts instead of fragmented text
  • Production-ready - Handle research papers, contracts, and technical docs with confidence

Installation

pip install tensorlake chonkie chonkie[model2vec]

Quick Start

Step 1: Parse Documents with Tensorlake

Tensorlake extracts structured data, tables, and figures while preserving reading order:
from tensorlake.documentai import (
    DocumentAI,
    ParsingOptions,
    StructuredExtractionOptions,
    EnrichmentOptions,
    ChunkingStrategy,
    PageFragmentType
)

# Initialize client
doc_ai = DocumentAI()

# Define schema for structured extraction
research_paper_schema = {
    "title": "ResearchPaper",
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {"type": "array", "items": {"type": "string"}},
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {"type": "array", "items": {"type": "string"}},
        "sections": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "heading": {"type": "string"},
                    "level": {"type": "integer", "description": "Heading level (1-6)"}
                }
            }
        }
    }
}

# Configure parsing
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.NONE,  # Let Chonkie handle chunking
    cross_page_header_detection=True
)

structured_extraction = StructuredExtractionOptions(
    schema_name="Research Paper Analysis",
    json_schema=research_paper_schema
)

enrichment_options = EnrichmentOptions(
    figure_summarization=True,
    figure_summarization_prompt="Summarize this figure in the context of the research paper.",
    table_summarization=True,
    table_summarization_prompt="Summarize this table's data and significance."
)

# Parse document
file_path = "https://tlake.link/docs/sota-research-paper"
parse_id = doc_ai.parse(
    file_path,
    parsing_options=parsing_options,
    structured_extraction_options=[structured_extraction],
    enrichment_options=enrichment_options
)

result = doc_ai.wait_for_completion(parse_id)

Step 2: (Optional) Understand Structured Data and Content

Review metadata, tables, and figures extracted by Tensorlake:
# Extract metadata
paper_metadata = result.structured_data[0].data if result.structured_data else {}
print(f"Title: {paper_metadata.get('title')}")
print(f"Authors: {', '.join(paper_metadata.get('authors', []))}")

# Get full markdown with preserved structure
full_markdown = result.chunks[0].content if result.chunks else ""

# Extract table summaries
print("\nTable Summaries:")
for page in result.pages:
    for i, fragment in enumerate(page.page_fragments):
        if fragment.fragment_type == PageFragmentType.TABLE:
            print(f"Table {i} (Page {page.page_number}): {fragment.content.summary}")

# Extract figure summaries
print("\nFigure Summaries:")
for page in result.pages:
    for i, fragment in enumerate(page.page_fragments):
        if fragment.fragment_type == PageFragmentType.FIGURE:
            print(f"Figure {i} (Page {page.page_number}): {fragment.content.summary}")
Output Example:
Title: State of the Art Research Paper
Authors: John Doe, Jane Smith

Table Summaries:
Table 0 (Page 3): Comparison of model performance metrics across three datasets
Table 1 (Page 5): Hyperparameter configurations used in experiments

Figure Summaries:
Figure 0 (Page 2): Training loss curve showing convergence after 50 epochs
Figure 1 (Page 4): Confusion matrix demonstrating high classification accuracy

Step 3: Semantic Chunking with Chonkie

Use Chonkie’s semantic chunker to create context-preserving chunks:
from chonkie import SemanticChunker

# Initialize semantic chunker
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    threshold=0.5,        # Similarity threshold for detecting topic boundaries
    chunk_size=1024,      # Target chunk size in tokens
    min_sentences=2,      # Minimum sentences per chunk
    mode="window"         # Use sliding window for boundary detection
)

# Chunk the markdown
semantic_chunks = []
for chunk in chunker.chunk(full_markdown):
    if chunk.text.strip():
        semantic_chunks.append({
            "text": chunk.text,
            "token_count": chunk.token_count
        })

print(f"Created {len(semantic_chunks)} semantic chunks")
Output:
Created 17 semantic chunks

Step 4: Review Chunk Quality

Inspect chunks to verify they preserve semantic meaning:
# Example chunk
print(semantic_chunks[7]["text"])
Example Chunk:
## 4. Experimental Results

We evaluated our approach on three benchmark datasets: ImageNet, COCO, and 
Pascal VOC. Table 1 shows the performance metrics across all datasets. Our 
method achieves state-of-the-art results, outperforming previous approaches 
by 3-5% on average.

The key insight is that combining attention mechanisms with residual connections 
enables the model to focus on relevant features while maintaining gradient flow. 
Figure 2 illustrates the attention maps learned by our model, showing clear 
focus on discriminative regions.
Notice how the chunk:
  • Includes complete thoughts from start to finish
  • Keeps tables and figures with their explanations
  • Respects section boundaries
  • Contains enough context for standalone understanding

How Semantic Chunking Works

Traditional chunking uses fixed token limits or recursive splitting. This breaks semantic units arbitrarily. Semantic chunking changes the approach:
  1. During parsing: Tensorlake extracts the full document with preserved structure
  2. Embedding: Chonkie embeds sentences using a lightweight model (model2vec)
  3. Boundary detection: Compares embeddings in a sliding window to find where topics shift
  4. Threshold-based splitting: When similarity drops below threshold, a new chunk begins
  5. Size constraints: Respects minimum and maximum chunk sizes while honoring boundaries
The key insight: Chunk boundaries align with topic transitions, not arbitrary character counts. This produces embeddings that capture complete ideas.

Use Cases

Research Paper Analysis

Parse academic papers with complex sections, tables, and figures. Semantic chunks keep methodology descriptions intact and don’t split results from their interpretation.

Technical Documentation

Process API docs, manuals, and specifications where hierarchical structure matters. Chunks respect code examples, parameter descriptions, and related content. Handle contracts and legal briefs where clauses must stay together. Semantic boundaries prevent splitting provisions mid-thought.

Financial Reports

Parse earnings reports and regulatory filings with dense tables and analysis sections. Keep financial data with its explanatory context.

Medical Literature

Process clinical studies where methods, results, and conclusions need separate chunks but internal coherence matters.

Best Practices

1. Choose the Right Threshold

Lower thresholds (0.3-0.5) create more chunks with tighter topic focus. Higher thresholds (0.6-0.8) create longer chunks spanning related topics.
# Tight topic focus - more chunks
chunker = SemanticChunker(threshold=0.4)

# Broader context - fewer chunks
chunker = SemanticChunker(threshold=0.7)

2. Balance Chunk Size and Semantics

Set chunk_size to match your embedding model’s context window. Use min_sentences to prevent tiny fragments.
chunker = SemanticChunker(
    chunk_size=512,      # For models like text-embedding-3-small
    min_sentences=3,     # Prevent single-sentence chunks
)

3. Leverage Tensorlake’s Structure

Use Tensorlake’s section detection and table summaries to enrich chunks:
# Add table summaries to chunks
table_summaries = {}
for page in result.pages:
    for frag in page.page_fragments:
        if frag.fragment_type == PageFragmentType.TABLE:
            table_summaries[page.page_number] = frag.content.summary

# Include summaries when embedding
for chunk in semantic_chunks:
    chunk["metadata"] = {
        "has_table": any(f"Page {p}" in chunk["text"] for p in table_summaries)
    }

4. Validate Chunk Quality

Check that chunks are semantically complete:
def validate_chunk_quality(chunks):
    """Ensure chunks have complete sentences and reasonable length."""
    for i, chunk in enumerate(chunks):
        text = chunk["text"]
        
        # Check for incomplete sentences
        if not text.strip().endswith((".", "!", "?", '"')):
            print(f"Warning: Chunk {i} may be incomplete")
        
        # Check token count range
        if chunk["token_count"] < 50:
            print(f"Warning: Chunk {i} is very short ({chunk['token_count']} tokens)")

validate_chunk_quality(semantic_chunks)

5. Store Chunks with Metadata

Include source information for retrieval:
enriched_chunks = []
for i, chunk in enumerate(semantic_chunks):
    enriched_chunks.append({
        "id": f"chunk_{i}",
        "text": chunk["text"],
        "token_count": chunk["token_count"],
        "metadata": {
            "source": paper_metadata.get("title"),
            "authors": paper_metadata.get("authors"),
            "chunk_index": i
        }
    })

Complete Example

Try the full working example with research paper analysis:

Semantic Chunking Pipeline Notebook

Complete code walkthrough with quality validation and embedding examples

What’s Next?

Use these chunks in a RAG system: Learn more about chunking strategies:

Resources

Need Help?

Join our community to discuss semantic chunking strategies:
I