1
Prerequisites
- Get your Tensorlake API key
- Install the dependencies:
Copy
Ask AI
pip install tensorlake chonkie chonkie[model2vec]
2
Import packages and setup client
Copy
Ask AI
import json
from tensorlake.documentai import (
DocumentAI,
ParsingOptions,
StructuredExtractionOptions,
EnrichmentOptions,
ChunkingStrategy,
TableParsingFormat,
TableOutputMode
)
# Initialize client
doc_ai = DocumentAI(TENSORLAKE_API_KEY=your_tensorlake_api_key)
3
Upload the document
For this example, we’ll use a research paper:
Copy
Ask AI
file_path = "https://tlake.link/docs/sota-research-paper"
4
Define the schema
Build a small JSON schema to extract what you need from each research paper.
Copy
Ask AI
research_paper_schema = {
"title": "ResearchPaper",
"type": "object",
"properties": {
"title": {"type": "string", "description": "Paper title"},
"authors": {"type": "array", "items": {"type": "string"}, "description": "Author names"},
"abstract": {"type": "string", "description": "Paper abstract"},
"keywords": {"type": "array", "items": {"type": "string"}, "description": "Key terms"},
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"heading": {"type": "string", "description": "Section heading"},
"level": {"type": "integer", "description": "Heading level (1-6)"}
}
}
}
}
}
5
Parse with Tensorlake
Configure parsing and structured extraction.
Copy
Ask AI
parsing_options = ParsingOptions(
chunking_strategy=ChunkingStrategy.NONE, # Parse the document in markdown format and let chonkie handle the chunking strategies
cross_page_header_detection=True
)
structured_extraction = StructuredExtractionOptions(
schema_name="Research Paper Analysis",
json_schema=research_paper_schema
)
enrichment_options = EnrichmentOptions(
figure_summarization=True,
figure_summarization_prompt="Summarize this figure/chart in the context of this research paper, focusing on key findings and data insights.",
table_summarization=True,
table_summarization_prompt="Summarize this table's data and its significance to the research findings."
)
parse_id = doc_ai.parse(
file_path,
parsing_options=parsing_options,
structured_extraction_options=[structured_extraction],
enrichment_options=enrichment_options
)
print(f"Parse job submitted with ID: {parse_id}")
result = doc_ai.wait_for_completion(parse_id)
6
Review the Tensorlake output
You’ll receive both structured data and a single chunk for the entire document.Example output:
Copy
Ask AI
paper_metadata = result.structured_data[0].data if result.structured_data else {}
print(paper_metadata)
# Extract chunks with cross-page hierarchy detection and full markdown
sections_with_content = []
for chunk in result.chunks:
print(chunk.content)
print(f"\nTensorlake extracted {len(result.chunks)} sections from the research paper!")
# Extract the tables and print out their summaries
for page in result.pages:
for (i, fragment) in enumerate(page.page_fragments):
if fragment.fragment_type == PageFragmentType.TABLE:
print(f"Table {i} in page {page.page_number}: {fragment.content.summary}\n")
print("--------------\n")
# Extract the figures and print out their summaries
for page in result.pages:
for (i, fragment) in enumerate(page.page_fragments):
if fragment.fragment_type == PageFragmentType.FIGURE:
print(f"figure {i} in page {page.page_number}: {fragment.content.summary}\n")
- Structured Data
- Chunks
- Tables
- Tables
Copy
Ask AI
{
'abstract': 'A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub, and the training codes are available in this project repository.',
'authors': ['Dun Zhang', 'Jiacheng Li', 'Ziyang Zeng', 'Fulong Wang'],
'keywords': ['Dense Retrieval',
'Embedding Models',
'Knowledge Distillation',
'Matryoshka Representation Learning',
'Multi-Modal Learning'],
'sections': [{'heading': 'Abstract', 'level': 2},
{'heading': '1 Introduction', 'level': 2},
{'heading': '2 Methods', 'level': 2},
{'heading': '2.1 Definitions', 'level': 3},
{'heading': '2.2 Model Architecture', 'level': 3},
{'heading': '2.3 Stage 1&2: Distillation from Multiple Teachers', 'level': 3},
{'heading': '2.4 Stage 3: Dimension Reduction', 'level': 3},
{'heading': '2.5 Stage 4: Unlock Multimodal Potential', 'level': 3},
{'heading': '3 Experiments', 'level': 2},
{'heading': '3.1 Implementation details', 'level': 3},
{'heading': '3.2 Datasets', 'level': 3},
{'heading': '3.3 Results', 'level': 3},
{'heading': '4 Discussion', 'level': 2},
{'heading': '4.1 Instruction Robustness', 'level': 3},
{'heading': '4.2 Possible Improvements for Vision Encoding', 'level': 3},
{'heading': '5 Conclusion', 'level': 2},
{'heading': 'References', 'level': 2}],
'title': 'Jasper and Stella: distillation of SOTA embedding models'
}
Copy
Ask AI
**Page 3: Figure 1 — Jasper Model Architecture**
The Jasper model combines text and image encoders into a unified embedding space.
- Image input → Siglip Vision Encoder → AvgPool2d
- Text input → Stella Embedding → Encoder
- Outputs: multiple FC layers (12288, 1024, 512, 256 dims) for distillation
- Uses cosine, similarity, and relative similarity loss
---
**Page 5: Table 1 — Embedding Model Comparison**
Shows performance across tasks (classification, clustering, retrieval, STS, summarization).
Highlights:
- NV-Embed-v2 (7851M) leads with avg. score 72.31
- Jasper (2B params) achieves 71.54 — competitive with 7B+ models
- Demonstrates smaller models can rival larger ones with efficient distillation
- Jasper excels in summarization and STS tasks
---
**Page 6: Table 2 & 3 — Instruction Variations**
Evaluates robustness to prompt changes:
- “Classify sentiment” vs. “Determine sentiment” → nearly identical scores
- Average score improves from 0.686 → 0.687
- Confirms Jasper generalizes across different instruction phrasings
Copy
Ask AI
Table 1 in page 5: This table presents the performance of various embedding models across different NLP tasks, including Classification, Clustering, PairClassification, Reranking, Retrieval, STS (Semantic Textual Similarity), and Summarization. The "Average (56 datasets)" column provides an overall performance score across a broader set of datasets.
**Summary of the Data:**
* **Model Performance:** The models are ranked by their average performance.
* **NV-Embed-v2** leads with an average score of 72.31, showing strong performance across most tasks, particularly Classification (90.37) and STS (84.31).
* **Jasper (our model)**, despite being a smaller model (1543M + 400M), achieves a highly competitive average score of 71.54, placing it second overall. It performs very well in Classification (88.49), PairClassification (88.07), and STS (84.67), and notably excels in Summarization (31.42), the highest among all models.
* **Bge-en-icl** and **Stella_en_1.5B_v5** also show strong overall performance (71.67 and 71.19 respectively), with Stella being a much smaller model.
* **SFR-Embedding-2_R** performs reasonably well but slightly lower than the top models.
* **gte-Qwen2-1.5B-instruct** and **voyage-lite-02-instruct** are the lowest performing models in this comparison, particularly in Clustering and Retrieval.
* **Model Size vs. Performance:**
* Larger models like NV-Embed-v2 (7851M) and Bge-en-icl (7111M) generally perform well.
* However, **Jasper (our model)** and **Stella_en_1.5B_v5** demonstrate that competitive or even superior performance can be achieved with significantly smaller model sizes (around 1.5B parameters) compared to the 7B+ parameter models.
**Significance to the Research Findings:**
The table's data is central to demonstrating the effectiveness and efficiency of the "Jasper" model, which is the focus of this research.
1. **Competitive Performance:** The data clearly shows that Jasper achieves an average score of 71.54, placing it second overall among the evaluated models. This is a significant finding as it indicates that Jasper is a highly competitive embedding model, performing on par with or even outperforming many established models.
2. **Efficiency and Resource Optimization:** A key significance is Jasper's performance relative to its size. At 1543M + 400M parameters, Jasper is considerably smaller than the top-performing NV-Embed-v2 (7851M) and Bge-en-icl (7111M). The fact that Jasper can achieve such high performance with a much smaller footprint (approximately 1/5th the size of the largest models) is a crucial finding. This suggests that Jasper is more resource-efficient, potentially leading to faster inference times, lower computational costs, and easier deployment in resource-constrained environments.
3. **Task-Specific Strengths:** The table highlights Jasper's particular strengths in Summarization (highest score) and strong performance in Classification, PairClassification, and STS. This indicates that Jasper is a versatile model, capable of handling a variety of NLP tasks effectively.
4. **Validation of Methodology:** The strong performance of Jasper, especially given its smaller size, validates the research's approach (likely distillation-based training as mentioned in the conclusion) for developing high-quality embedding models without necessarily relying on massive parameter counts.
In summary, the table's data underscores that Jasper is a state-of-the-art embedding model that offers a compelling balance of high performance across diverse NLP tasks and remarkable efficiency due to its smaller model size, which is a key contribution of this research.
--------------
Table 1 in page 6: This table, "Table 2: Original instructions and corresponding synonyms," presents a two-column list. The left column, "Original Instruction," contains a series of natural language instructions for various tasks, such as classifying sentiment, retrieving documents, or identifying main categories. The right column, "Synonym of Original Instruction," provides a rephrased or synonymous version of each instruction from the left column.
**Summary of the Table's Data:**
The table essentially showcases pairs of instructions where the "Synonym" column offers an alternative phrasing for the "Original Instruction." For example:
* "Classify the sentiment expressed in the given movie review text from the IMDB dataset" is rephrased as "Determine the sentiment conveyed in the provided movie review text from the IMDB dataset."
* "Retrieve duplicate questions from StackOverflow forum" becomes "Find duplicate questions on the StackOverflow forum."
* "Given a financial question, retrieve user replies that best answer the question" is rephrased as "Given a financial question, find user replies that best answer it."
The rephrasing often involves substituting verbs (e.g., "classify" to "determine," "retrieve" to "find"), simplifying sentence structure, or using slightly different but semantically equivalent terms.
**Significance to the Research Findings:**
This table is crucial for understanding the methodology and the robustness of the research, particularly in the context of the "MTEB Results on different instructions" table (Table 3) and the surrounding text.
1. **Evaluating Model Robustness to Instruction Variation:** The core significance is that the research likely aims to evaluate how well a model (specifically, the designed three-level embedding model mentioned in the text) performs when given different phrasings of the same task instruction. By providing both "Original Instructions" and their "Synonyms," the researchers can test if the model's performance is consistent regardless of minor variations in the prompt. This is a critical aspect of model robustness and generalizability in natural language processing.
2. **Demonstrating the Need for Flexible Instruction Understanding:** If a model's performance drops significantly when given a synonymous instruction compared to the original, it indicates a lack of true understanding or an over-reliance on specific keywords or sentence structures. This table provides the concrete examples of the instruction variations used to probe this aspect.
3. **Context for MTEB Results (Table 3):** The "MTEB Results on different instructions" table (Table 3) directly uses these instructions. The "Original Score" in Table 3 likely corresponds to the model's performance using the "Original Instruction," while the "Score with MTEB Instruction" might correspond to performance using the "Synonym of Original Instruction" or another standardized MTEB-specific phrasing derived from these. The comparison of these scores (e.g., 0.697 vs. 0.687 for "Average Score") directly reflects how the model handles these instruction variations. The fact that the scores are generally close suggests good robustness.
In essence, Table 2 lays the groundwork for demonstrating that the research's proposed model is not just good at following one specific command, but can generalize its understanding to various ways of expressing the same command, which is a highly desirable trait for practical NLP applications.
--------------
Table 3 in page 6: The table "MTEB Results on different instructions" compares the performance of various NLP tasks (Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, Summarization) under two conditions: "Original Score" (using original instructions) and "Score with Modified Instructions" (using instructions modified for better performance, as detailed in Table 2).
**Summary of the Data:**
* **Overall Improvement:** The "Average Score" across all tasks shows a notable improvement from 0.686 (Original Score) to 0.687 (Score with Modified Instructions). While seemingly small, this indicates a general positive impact of instruction modification.
* **Task-Specific Variations:**
* **Classification:** Most classification tasks show slight improvements or remain stable. For instance, "MTOPDomainClassification" goes from 0.992 to 0.992, while "AmazonCounterfactualClassification" improves from 0.958 to 0.959. "TweetSentimentExtractionClassification" sees a more significant jump from 0.773 to 0.776.
* **Clustering:** Similar to classification, clustering tasks generally show minor improvements. "StackExchangeClusteringP2P" improves from 0.494 to 0.495, and "TwentyNewsgroupsClustering" from 0.630 to 0.630.
* **Pair Classification:** Both "SprintDuplicateQuestions" and "TwitterSemEval2015" show improvements.
* **Reranking:** "AskUbuntuDupQuestions" improves from 0.674 to 0.676.
* **Retrieval:** Many retrieval tasks show improvements, some more substantial than others. For example, "CQADupstackMathematicsRetrieval" improves from 0.369 to 0.370, and "TRECOVID" from 0.865 to 0.866. Some tasks like "CQADupstackEnglishRetrieval" show no change.
* **STS (Semantic Textual Similarity):** Most STS tasks show improvements, with "STS12" going from 0.897 to 0.898 and "STSBenchmark" from 0.888 to 0.886 (a slight decrease here).
* **Summarization:** "SummEval" shows a slight improvement from 0.313 to 0.314.
**Significance to the Research Findings:**
The table's data is crucial as it directly supports the research's central hypothesis: **that carefully crafted instructions can significantly impact the performance of large language models (LLMs) on various downstream NLP tasks.**
* **Validation of Instruction Engineering:** The consistent, albeit sometimes small, improvements across a wide range of tasks demonstrate the effectiveness of "instruction engineering" or "prompt engineering." This highlights that how a task is presented to an LLM (via its instructions) is not trivial and can indeed unlock better performance.
* **Implications for Model Deployment and Fine-tuning:** The findings suggest that instead of solely relying on model architecture improvements or extensive fine-tuning, optimizing the input instructions can be a cost-effective and efficient way to boost performance. This is particularly relevant for deploying LLMs in real-world applications where fine-tuning might be resource-intensive.
* **Understanding LLM Sensitivity:** The varying degrees of improvement across tasks indicate that LLMs are differentially sensitive to instruction modifications depending on the task type and specific dataset. This provides insights into the internal workings and sensitivities of these models.
* **Contribution to MTEB (Massive Text Embedding Benchmark):** By showing that instruction modifications can alter benchmark scores, the research contributes to a deeper understanding of how to properly evaluate and compare text embeddings and LLM capabilities. It suggests that standardized instructions are critical for fair comparisons.
* **Future Research Direction:** The results encourage further research into automated or more sophisticated methods for generating optimal instructions, moving beyond manual trial-and-error. The slight decrease in "STSBenchmark" also points to the complexity that not all modifications are universally beneficial and require careful tuning.
In essence, the table provides empirical evidence that **"how you ask" can be as important as "what you ask"** when interacting with large language models, underscoring the growing importance of prompt engineering in the field of natural language processing.
--------------
Copy
Ask AI
figure 1 in page 3: This figure illustrates the architecture of the "Jasper" model, designed for multi-modal (image and text) representation learning, specifically in the context of "distillation from multiple teachers."
**Key Findings and Data Insights from the Figure and Context:**
1. **Multi-Modal Input Processing:** The model accepts both image and text inputs.
* **Image Path:** Images are first processed by a "Siglip Vision Encoder" followed by an "AvgPool2d" layer to extract visual features.
* **Text Path:** Text inputs are processed by a "Stella Input Embedding" layer to generate textual features.
2. **Unified Encoder and Representation:** Both visual and textual features are fed into a "Stella Encoder," suggesting a mechanism to learn a unified, multi-modal representation.
3. **Multi-Output Architecture for Teacher Distillation:**
* After the Stella Encoder and "Mean Polling," the model branches into **four distinct Fully Connected (FC) layers (FC1, FC2, FC3, FC4)**, producing output vectors of varying dimensions: 12288, 1024, 512, and 256.
* **Crucial Insight (from text):** The 12288-dimensional output (from FC1) is specifically designed to align with the combined vector dimensions of **two primary teacher models** (M_Ented_en_2^6 and Stella_en_1.5B_v5^6), which have dimensions of 4096 and 8192 respectively (4096 + 8192 = 12288). This highlights a core mechanism for the student model to learn from the combined knowledge of these teachers.
* The presence of multiple output heads (FC1-FC4) suggests the model can learn representations at different granularities or for alignment with various teacher models or tasks, although the text primarily focuses on the 12288-dim output for the initial distillation stages.
4. **Objective of Distillation:** The primary goal of this multi-output structure is to enable the "student model" (Jasper) to effectively learn robust text representations by aligning its output vectors with the corresponding vectors from multiple teacher models. This alignment is achieved through a combination of three carefully designed loss functions (cosine loss, similarity loss, and relative similarity distillation loss), which are applied to these output vectors to minimize the difference between student and teacher representations.
7
Recursive Chunking using Chonkie
Now we take Tensorlake’s rich extracted parsed data, we will now leverage Chonkie (one of the faster growing chunking library in AI Devtools) to decide how to break the text into chunks for RAG. This ensures our chunks are the perfect size for embedding models while preserving semantic boundaries.For research papers and other dense technical documents, we recommend
semantic chunking. Unlike recursive chunking, which slices text by token limits, semantic chunking uses embeddings to detect boundaries where topics naturally shift. This produces higher-quality chunks that are easier to retrieve and align closely with the author’s intent.Copy
Ask AI
from chonkie import SemanticChunker
chunker = SemanticChunker(
embedding_model="minishlab/potion-base-8M",
threshold=0.5,
chunk_size=1024,
min_sentences=2,
mode="window"
)
semantic_chunks = []
for ch in chunker.chunk(full_markdown):
if ch.text.strip():
semantic_chunks.append({
"text": ch.text,
"token_count": ch.token_count
})
print(f"SemanticChunker produced {len(semantic_chunks)} chunks")
Copy
Ask AI
SemanticChunker produced 17 chunks
8
Review Chonkie's Semantic Chunks
Now we can review the chunks and see the quality of using Chonkie’s semantic chunking:
You can now leverage those chunks to embed and store them in vector DB to build a smart RAG. You can check out our Qdrant tutorial to learn more.
- Chunks are semantically meaningful and complete.
- Embeddings formed from these chunks will be of higher-quality since each chunk represents a cohesive unit of meaning.
semantic_chunks[7].text we would get:
To learn more about different chunking with Chonkie, check out our blog post “Fix Broken Context in RAG with Tensorlake + Chonkie”.