Why Use Tensorlake + Chonkie?
The Problem:- Fixed-size chunking breaks sentences mid-thought and splits tables from context
- Token-based splitting ignores document structure (sections, subsections, figures)
- Chunks lose hierarchical meaning, leading to poor retrieval in RAG
- Dense technical documents need semantic boundaries, not arbitrary character limits
- Semantic boundaries - Chunks align with natural topic transitions, not arbitrary limits
- Structure preservation - Respect sections, subsections, and hierarchical organization
- Better retrieval - Embeddings capture complete thoughts instead of fragmented text
- Production-ready - Handle research papers, contracts, and technical docs with confidence
Installation
Quick Start
Step 1: Parse Documents with Tensorlake
Tensorlake extracts structured data, tables, and figures while preserving reading order:Step 2: (Optional) Understand Structured Data and Content
Review metadata, tables, and figures extracted by Tensorlake:Step 3: Semantic Chunking with Chonkie
Use Chonkie’s semantic chunker to create context-preserving chunks:Step 4: Review Chunk Quality
Inspect chunks to verify they preserve semantic meaning:- Includes complete thoughts from start to finish
- Keeps tables and figures with their explanations
- Respects section boundaries
- Contains enough context for standalone understanding
How Semantic Chunking Works
Traditional chunking uses fixed token limits or recursive splitting. This breaks semantic units arbitrarily. Semantic chunking changes the approach:- During parsing: Tensorlake extracts the full document with preserved structure
- Embedding: Chonkie embeds sentences using a lightweight model (model2vec)
- Boundary detection: Compares embeddings in a sliding window to find where topics shift
- Threshold-based splitting: When similarity drops below threshold, a new chunk begins
- Size constraints: Respects minimum and maximum chunk sizes while honoring boundaries
Use Cases
Research Paper Analysis
Parse academic papers with complex sections, tables, and figures. Semantic chunks keep methodology descriptions intact and don’t split results from their interpretation.Technical Documentation
Process API docs, manuals, and specifications where hierarchical structure matters. Chunks respect code examples, parameter descriptions, and related content.Legal Document Processing
Handle contracts and legal briefs where clauses must stay together. Semantic boundaries prevent splitting provisions mid-thought.Financial Reports
Parse earnings reports and regulatory filings with dense tables and analysis sections. Keep financial data with its explanatory context.Medical Literature
Process clinical studies where methods, results, and conclusions need separate chunks but internal coherence matters.Best Practices
1. Choose the Right Threshold
Lower thresholds (0.3-0.5) create more chunks with tighter topic focus. Higher thresholds (0.6-0.8) create longer chunks spanning related topics.2. Balance Chunk Size and Semantics
Setchunk_size
to match your embedding model’s context window. Use min_sentences
to prevent tiny fragments.
3. Leverage Tensorlake’s Structure
Use Tensorlake’s section detection and table summaries to enrich chunks:4. Validate Chunk Quality
Check that chunks are semantically complete:5. Store Chunks with Metadata
Include source information for retrieval:Complete Example
Try the full working example with research paper analysis:Semantic Chunking Pipeline Notebook
Complete code walkthrough with quality validation and embedding examples
What’s Next?
Use these chunks in a RAG system:- Qdrant Integration - Store semantic chunks with metadata
- ChromaDB Integration - Add citation tracking to chunks
- Blog: Fix Broken Context in RAG - Deep dive into semantic chunking