Use this file to discover all available pages before exploring further.
Qdrant is a high-performance vector database built for AI applications. When combined with Tensorlake’s document parsing, you get RAG systems with complete document understanding—including table summaries, figure descriptions, and metadata filtering that traditional parsing misses.This integration is essential for academic research, financial reports, legal documents, and technical documentation where visual content matters.
Traditional parsing loses tables, figures, and reading order
Text-only embeddings miss critical visual content
Generic chunking breaks document structure and context
No way to filter by document metadata before searching
The Solution:
Tensorlake preserves tables, figures, and structure during parsing. Qdrant stores embeddings with rich metadata for filtering. Together, they create RAG systems that understand complete documents.Key Benefits:
Richer embeddings - Include table summaries and figure descriptions, not just text
Advanced filtering - Search within specific authors, dates, or document types
Transform parsed content into embeddings and store with metadata for filtering:
from qdrant_client import QdrantClient, modelsfrom sentence_transformers import SentenceTransformerfrom uuid import uuid4# Initialize Qdrant and embedding modelqdrant_client = QdrantClient(":memory:") # Use QdrantClient(url="...") for productionmodel = SentenceTransformer("all-MiniLM-L6-v2")collection_name = "research_papers"# Create collectionqdrant_client.create_collection( collection_name=collection_name, vectors_config=models.VectorParams( size=model.get_sentence_embedding_dimension(), distance=models.Distance.COSINE ))# Extract structured metadatastructured_metadata = result.structured_data[0].data if result.structured_data else {}# Create embeddings for text chunkspoints = []for chunk in result.chunks: embedding = model.encode(chunk.content).tolist() payload = { **structured_metadata, 'content': chunk.content, 'type': 'text' } points.append(models.PointStruct( id=str(uuid4()), vector=embedding, payload=payload ))# Create embeddings for table summariesfor page in result.pages: for fragment in page.page_fragments: if fragment.fragment_type == "table" and fragment.content.summary: embedding = model.encode(fragment.content.summary).tolist() payload = { **structured_metadata, 'content': fragment.content.summary, 'type': 'table', 'page': page.page_number } points.append(models.PointStruct( id=str(uuid4()), vector=embedding, payload=payload ))# Create embeddings for figure summariesfor page in result.pages: for fragment in page.page_fragments: if fragment.fragment_type == "figure" and fragment.content.summary: embedding = model.encode(fragment.content.summary).tolist() payload = { **structured_metadata, 'content': fragment.content.summary, 'type': 'figure', 'page': page.page_number } points.append(models.PointStruct( id=str(uuid4()), vector=embedding, payload=payload ))# Upload to Qdrantqdrant_client.upsert(collection_name=collection_name, points=points)# Create index for filtering by authorqdrant_client.create_payload_index( collection_name=collection_name, field_name="authors", field_schema="keyword",)print(f"Uploaded {len(points)} embeddings to Qdrant")
Key Insight: Table and figure summaries get their own embeddings. When users search for data, they’ll retrieve both text explanations and visual content summaries.
Combine semantic search with metadata filtering for precise results:
# Query with author filterquery = "Does computer science education improve problem solving skills?"query_embedding = model.encode(query).tolist()results = qdrant_client.query_points( collection_name=collection_name, query=query_embedding, query_filter=models.Filter( must=[ models.FieldCondition( key="authors", match=models.MatchValue(value="William G. Griswold"), ) ] ), limit=5,).points# Display resultsfor point in results: print(f"Title: {point.payload.get('title', 'Unknown')}") print(f"Authors: {point.payload.get('authors', 'Unknown')}") print(f"Score: {point.score:.4f}") print(f"Content: {point.payload.get('content')[:200]}...") print("-" * 80)
Output Example:
Title: Teaching Problem Solving Through CS EducationAuthors: ['William G. Griswold', 'Jane Smith']Score: 0.8752Content: Our study demonstrates that computer science courses significantly improve students' problem-solving abilities across multiple domains...--------------------------------------------------------------------------------
Traditional RAG only embeds text chunks. Tables and figures are ignored or poorly represented.This integration changes the data flow:
During parsing: Tensorlake extracts tables and generates summaries like “Comparison of model accuracy across three datasets showing 5-10% improvement”
During embedding: Both text chunks and table/figure summaries get vectorized separately
During storage: Metadata (authors, year, conference) is stored as filterable payload fields
During retrieval: Queries match against text AND visual content summaries
During response: Results include both explanatory text and data from tables/figures
The key insight: Visual content becomes searchable through AI-generated summaries, dramatically improving RAG accuracy for documents with tables and figures.
Parse earnings reports, balance sheets, and regulatory filings. Filter by company, quarter, and fiscal year while searching across narrative and tabular data.
Parse clinical studies with methods, results, and patient data tables. Filter by study type, date, or authors while retrieving complete experimental context.
For tables spanning multiple pages, create summaries with proper context:
enrichment_options = EnrichmentOptions( table_summarization_prompt="""Summarize this table's data including: 1. What the table measures 2. Key findings or patterns 3. How it relates to the paper's main argument""")
Spot-check that table and figure summaries are meaningful:
# Review summaries before embeddingfor page in result.pages: for fragment in page.page_fragments: if fragment.fragment_type == "table": print(f"Table summary: {fragment.content.summary}") # Ensure it's descriptive, not just "A table of data"