Skip to main content
Qdrant is a high-performance vector database built for AI applications. When combined with Tensorlake’s document parsing, you get RAG systems with complete document understanding—including table summaries, figure descriptions, and metadata filtering that traditional parsing misses. This integration is essential for academic research, financial reports, legal documents, and technical documentation where visual content matters.
Run this end-to-end in Colab:

Why Use Tensorlake + Qdrant?

The Problem:
  • Traditional parsing loses tables, figures, and reading order
  • Text-only embeddings miss critical visual content
  • Generic chunking breaks document structure and context
  • No way to filter by document metadata before searching
The Solution: Tensorlake preserves tables, figures, and structure during parsing. Qdrant stores embeddings with rich metadata for filtering. Together, they create RAG systems that understand complete documents. Key Benefits:
  • Richer embeddings - Include table summaries and figure descriptions, not just text
  • Advanced filtering - Search within specific authors, dates, or document types
  • Better chunking - Semantic sections (abstract, methods, results) instead of arbitrary splits
  • Accurate results - Complete context leads to more relevant retrieval

Installation

pip install tensorlake qdrant-client sentence-transformers

Quick Start

Step 1: Parse Documents with Tensorlake

Configure Tensorlake to extract structured data, tables, and figures in one API call:
from tensorlake.documentai import (
    DocumentAI,
    ParsingOptions,
    StructuredExtractionOptions,
    EnrichmentOptions,
    ChunkingStrategy,
    TableParsingFormat,
    TableOutputMode
)

# Initialize client
doc_ai = DocumentAI()

# Define schema for structured extraction
json_schema = {
    "title": "ResearchPaper",
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "authors": {"type": "array", "items": {"type": "string"}},
        "abstract": {"type": "string"},
        "conference_name": {"type": "string"},
        "publication_year": {"type": "integer"}
    }
}

# Configure parsing
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.SECTION,  # Chunk by semantic sections
    table_parsing_strategy=TableParsingFormat.TSR,
    table_output_mode=TableOutputMode.MARKDOWN,
)

structured_extraction_options = [StructuredExtractionOptions(
    schema_name="ResearchPaper",
    json_schema=json_schema,
)]

enrichment_options = EnrichmentOptions(
    figure_summarization=True,
    figure_summarization_prompt="Summarize this figure in the context of the research paper.",
    table_summarization=True,
    table_summarization_prompt="Summarize this table's data and significance to the research.",
)

# Parse document
file_url = "https://example.com/research_paper.pdf"
parse_id = doc_ai.parse(
    file_url,
    parsing_options,
    structured_extraction_options,
    enrichment_options
)

result = doc_ai.wait_for_completion(parse_id)
What You Get:
  • Markdown chunks preserving reading order and structure
  • Structured metadata (title, authors, conference, year)
  • Table summaries that capture data meaning
  • Figure descriptions explaining visual content

Step 2: Create Embeddings and Store in Qdrant

Transform parsed content into embeddings and store with metadata for filtering:
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
from uuid import uuid4

# Initialize Qdrant and embedding model
qdrant_client = QdrantClient(":memory:")  # Use QdrantClient(url="...") for production
model = SentenceTransformer("all-MiniLM-L6-v2")

collection_name = "research_papers"

# Create collection
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=model.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE
    )
)

# Extract structured metadata
structured_metadata = result.structured_data[0].data if result.structured_data else {}

# Create embeddings for text chunks
points = []

for chunk in result.chunks:
    embedding = model.encode(chunk.content).tolist()
    payload = {
        **structured_metadata,
        'content': chunk.content,
        'type': 'text'
    }
    points.append(models.PointStruct(
        id=str(uuid4()),
        vector=embedding,
        payload=payload
    ))

# Create embeddings for table summaries
for page in result.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "table" and fragment.content.summary:
            embedding = model.encode(fragment.content.summary).tolist()
            payload = {
                **structured_metadata,
                'content': fragment.content.summary,
                'type': 'table',
                'page': page.page_number
            }
            points.append(models.PointStruct(
                id=str(uuid4()),
                vector=embedding,
                payload=payload
            ))

# Create embeddings for figure summaries
for page in result.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "figure" and fragment.content.summary:
            embedding = model.encode(fragment.content.summary).tolist()
            payload = {
                **structured_metadata,
                'content': fragment.content.summary,
                'type': 'figure',
                'page': page.page_number
            }
            points.append(models.PointStruct(
                id=str(uuid4()),
                vector=embedding,
                payload=payload
            ))

# Upload to Qdrant
qdrant_client.upsert(collection_name=collection_name, points=points)

# Create index for filtering by author
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="authors",
    field_schema="keyword",
)

print(f"Uploaded {len(points)} embeddings to Qdrant")
Key Insight: Table and figure summaries get their own embeddings. When users search for data, they’ll retrieve both text explanations and visual content summaries.

Step 3: Query with Filtering

Combine semantic search with metadata filtering for precise results:
# Query with author filter
query = "Does computer science education improve problem solving skills?"
query_embedding = model.encode(query).tolist()

results = qdrant_client.query_points(
    collection_name=collection_name,
    query=query_embedding,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="authors",
                match=models.MatchValue(value="William G. Griswold"),
            )
        ]
    ),
    limit=5,
).points

# Display results
for point in results:
    print(f"Title: {point.payload.get('title', 'Unknown')}")
    print(f"Authors: {point.payload.get('authors', 'Unknown')}")
    print(f"Score: {point.score:.4f}")
    print(f"Content: {point.payload.get('content')[:200]}...")
    print("-" * 80)
Output Example:
Title: Teaching Problem Solving Through CS Education
Authors: ['William G. Griswold', 'Jane Smith']
Score: 0.8752
Content: Our study demonstrates that computer science courses significantly 
improve students' problem-solving abilities across multiple domains...
--------------------------------------------------------------------------------

Step 4: Build an Intelligent Agent

Let AI decide when to apply filters based on query intent:
from openai import OpenAI

client = OpenAI()

def smart_search(query: str):
    """Use LLM to extract filters and search Qdrant."""
    
    # Extract filters using LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Extract author names from the query if mentioned. Return JSON."
            },
            {"role": "user", "content": query}
        ]
    )
    
    filters_json = response.choices[0].message.content
    # Parse and build Qdrant filter...
    
    # Search with extracted filters
    results = qdrant_client.query_points(
        collection_name=collection_name,
        query=model.encode(query).tolist(),
        query_filter=build_filter(filters_json),
        limit=5
    )
    
    return results

# Example queries
smart_search("What did John Doe publish about neural networks?")
smart_search("Recent papers on transformer architectures from 2024")

How Rich Embeddings Work

Traditional RAG only embeds text chunks. Tables and figures are ignored or poorly represented. This integration changes the data flow:
  1. During parsing: Tensorlake extracts tables and generates summaries like “Comparison of model accuracy across three datasets showing 5-10% improvement”
  2. During embedding: Both text chunks and table/figure summaries get vectorized separately
  3. During storage: Metadata (authors, year, conference) is stored as filterable payload fields
  4. During retrieval: Queries match against text AND visual content summaries
  5. During response: Results include both explanatory text and data from tables/figures
The key insight: Visual content becomes searchable through AI-generated summaries, dramatically improving RAG accuracy for documents with tables and figures.

Use Cases

Academic Research

Search through research papers with complex layouts. Retrieve both methodology text and experimental results from tables in a single query.

Financial Reports

Parse earnings reports, balance sheets, and regulatory filings. Filter by company, quarter, and fiscal year while searching across narrative and tabular data. Handle contracts and regulatory documents with proper structure. Filter by contract type, date range, or parties while searching clause content.

Technical Documentation

Process API docs, manuals, and specifications. Search across text explanations and data tables showing parameters, configurations, or benchmarks.

Medical Literature

Parse clinical studies with methods, results, and patient data tables. Filter by study type, date, or authors while retrieving complete experimental context.

Best Practices

1. Optimize Chunking Strategy

Use semantic chunking by section rather than fixed token limits. Include section headers for context.
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.SECTION,  # Not FIXED_SIZE
)

2. Create Strategic Indices

Index frequently filtered fields for performance:
# Common filters
qdrant_client.create_payload_index(collection_name, "authors", "keyword")
qdrant_client.create_payload_index(collection_name, "publication_year", "integer")
qdrant_client.create_payload_index(collection_name, "conference_name", "keyword")

3. Handle Large Tables Intelligently

For tables spanning multiple pages, create summaries with proper context:
enrichment_options = EnrichmentOptions(
    table_summarization_prompt="""Summarize this table's data including:
    1. What the table measures
    2. Key findings or patterns
    3. How it relates to the paper's main argument"""
)

4. Combine Multiple Filter Conditions

Build complex queries that narrow results effectively:
query_filter=models.Filter(
    must=[
        models.FieldCondition(key="publication_year", range=models.Range(gte=2020)),
        models.FieldCondition(key="conference_name", match=models.MatchValue(value="NeurIPS"))
    ]
)

5. Validate Embeddings Quality

Spot-check that table and figure summaries are meaningful:
# Review summaries before embedding
for page in result.pages:
    for fragment in page.page_fragments:
        if fragment.fragment_type == "table":
            print(f"Table summary: {fragment.content.summary}")
            # Ensure it's descriptive, not just "A table of data"

Complete Example

Try the full working example with research paper search:

RAG with Filtering Notebook

Complete code walkthrough including agent-based filtering and result ranking

What’s Next?

Build on this foundation: Learn more about document AI:

Resources

Need Help?

Join our community to discuss RAG architectures:
I