Use this file to discover all available pages before exploring further.
Chonkie is a fast, lightweight chunking library that uses embeddings to detect natural topic boundaries. When combined with Tensorlake’s document parsing, you get intelligent chunking that respects semantic structure. Perfect for research papers, technical documentation, and dense content.Combining Chonkie and Tensorlake eliminates broken context in RAG systems where naive chunking splits thoughts mid-sentence or separates tables from explanations.
Chunks lose hierarchical meaning, leading to poor retrieval in RAG
Dense technical documents need semantic boundaries, not arbitrary character limits
The Solution:
Tensorlake preserves document structure during parsing. Chonkie uses embeddings to detect where topics naturally shift. Together, they produce context-preserving chunks that align with the author’s intent.Key Benefits:
Semantic boundaries - Chunks align with natural topic transitions, not arbitrary limits
Structure preservation - Respect sections, subsections, and hierarchical organization
Better retrieval - Embeddings capture complete thoughts instead of fragmented text
Production-ready - Handle research papers, contracts, and technical docs with confidence
Step 2: (Optional) Understand Structured Data and Content
Review metadata, tables, and figures extracted by Tensorlake:
# Extract metadatapaper_metadata = result.structured_data[0].data if result.structured_data else {}print(f"Title: {paper_metadata.get('title')}")print(f"Authors: {', '.join(paper_metadata.get('authors', []))}")# Get full markdown with preserved structurefull_markdown = result.chunks[0].content if result.chunks else ""# Extract table summariesprint("\nTable Summaries:")for page in result.pages: for i, fragment in enumerate(page.page_fragments): if fragment.fragment_type == PageFragmentType.TABLE: print(f"Table {i} (Page {page.page_number}): {fragment.content.summary}")# Extract figure summariesprint("\nFigure Summaries:")for page in result.pages: for i, fragment in enumerate(page.page_fragments): if fragment.fragment_type == PageFragmentType.FIGURE: print(f"Figure {i} (Page {page.page_number}): {fragment.content.summary}")
Output Example:
Title: State of the Art Research PaperAuthors: John Doe, Jane SmithTable Summaries:Table 0 (Page 3): Comparison of model performance metrics across three datasetsTable 1 (Page 5): Hyperparameter configurations used in experimentsFigure Summaries:Figure 0 (Page 2): Training loss curve showing convergence after 50 epochsFigure 1 (Page 4): Confusion matrix demonstrating high classification accuracy
Inspect chunks to verify they preserve semantic meaning:
# Example chunkprint(semantic_chunks[7]["text"])
Example Chunk:
## 4. Experimental ResultsWe evaluated our approach on three benchmark datasets: ImageNet, COCO, and Pascal VOC. Table 1 shows the performance metrics across all datasets. Our method achieves state-of-the-art results, outperforming previous approaches by 3-5% on average.The key insight is that combining attention mechanisms with residual connections enables the model to focus on relevant features while maintaining gradient flow. Figure 2 illustrates the attention maps learned by our model, showing clear focus on discriminative regions.
Notice how the chunk:
Includes complete thoughts from start to finish
Keeps tables and figures with their explanations
Respects section boundaries
Contains enough context for standalone understanding
Parse academic papers with complex sections, tables, and figures. Semantic chunks keep methodology descriptions intact and don’t split results from their interpretation.
Process API docs, manuals, and specifications where hierarchical structure matters. Chunks respect code examples, parameter descriptions, and related content.
Use Tensorlake’s section detection and table summaries to enrich chunks:
# Add table summaries to chunkstable_summaries = {}for page in result.pages: for frag in page.page_fragments: if frag.fragment_type == PageFragmentType.TABLE: table_summaries[page.page_number] = frag.content.summary# Include summaries when embeddingfor chunk in semantic_chunks: chunk["metadata"] = { "has_table": any(f"Page {p}" in chunk["text"] for p in table_summaries) }
def validate_chunk_quality(chunks): """Ensure chunks have complete sentences and reasonable length.""" for i, chunk in enumerate(chunks): text = chunk["text"] # Check for incomplete sentences if not text.strip().endswith((".", "!", "?", '"')): print(f"Warning: Chunk {i} may be incomplete") # Check token count range if chunk["token_count"] < 50: print(f"Warning: Chunk {i} is very short ({chunk['token_count']} tokens)")validate_chunk_quality(semantic_chunks)