1
Prerequisites
- Get your Tensorlake API key
- Install the dependencies:
2
Import packages and setup client
3
Upload the document
For this example, we’ll use a research paper:
4
Define the schema
Build a small JSON schema to extract what you need from each research paper.
5
Parse with Tensorlake
Configure parsing and structured extraction.
6
Review the Tensorlake output
You’ll receive both structured data and a single chunk for the entire document.Example output:
7
Recursive Chunking using Chonkie
Now we take TensorLake’s rich extracted parsed data, we will now leverage Chonkie (one of the faster growing chunking library in AI Devtools) to decide how to break the text into chunks for RAG. This ensures our chunks are the perfect size for embedding models while preserving semantic boundaries.For research papers and other dense technical documents, we recommend
semantic chunking
. Unlike recursive chunking
, which slices text by token limits, semantic chunking uses embeddings to detect boundaries where topics naturally shift. This produces higher-quality chunks that are easier to retrieve and align closely with the author’s intent.8
Review Chonkie's Semantic Chunks
Now we can review the chunks and see the quality of using Chonkie’s semantic chunking:You can now leverage those chunks to embed and store them in vector DB to build a smart RAG. You can check out our Qdrant tutorial to learn more.
- Chunks are semantically meaningful and complete.
- Embeddings formed from these chunks will be of higher-quality since each chunk represents a cohesive unit of meaning.
semantic_chunks[7].text
we would get:To learn more about different chunking with Chonkie, check out our blog post “Fix Broken Context in RAG with Tensorlake + Chonkie”.