Combining Tensorlake’s document parsing engine with Qdrant’s vector database creates powerful RAG applications with more complete, accurate embeddings and advanced filtering capabilities. This integration enables you to build knowledge systems that understand complex documents while providing fast, relevant search results.

Colab Notebook Example

Try the complete Tensorlake + Qdrant integration in this Colab notebook

Why Tensorlake + Qdrant?

Complete Document Understanding

Traditional document parsing often misses critical information or corrupts reading order, leading to incomplete embeddings. Tensorlake provides:

  • Preserved Reading Order: Complex layouts like multi-column text, headers, and mixed content are properly parsed
  • Table & Figure Summarization: Extract meaningful content from visual elements that would otherwise be lost
  • Structured Data Extraction: Pull metadata and key information for advanced filtering
  • Single API Call: Get markdown conversion, structured data, and summaries in one request

When you combine Tensorlake’s comprehensive parsing with Qdrant’s vector capabilities, you get:

  • Richer Embeddings: Include table summaries, figure descriptions, and proper text flow
  • Advanced Filtering: Use extracted metadata to filter before semantic search
  • Better Chunking: Semantic sections (abstract, introduction, methodology) instead of arbitrary splits
  • Accurate Results: Complete context leads to more relevant search results

Use Cases

This integration is particularly powerful for:

  • Academic Research: Search through research papers with complex layouts and important tables/figures
  • Financial Reports: Parse complex financial documents with tables, charts, and structured data
  • Legal Documents: Handle contracts, legal briefs, and regulatory documents with proper structure
  • Technical Documentation: Process manuals, specifications, and guides with diagrams and tables

Implementation Guide

Step 1: Parse Documents with Tensorlake

Start by defining what structured data you want to extract from your documents. This might include metadata like authors, titles, dates, or domain-specific information. Configure Tensorlake to chunk your documents semantically (by sections like abstract, introduction, methodology) rather than using arbitrary size limits.

When you send documents to Tensorlake’s parse endpoint, you’ll receive three key outputs:

  • Markdown chunks that preserve reading order and document structure
  • Structured data in JSON format based on your defined schema
  • Table and figure summaries that capture the meaning of visual content

The power here is getting all this information in a single API call, ensuring nothing is lost in the parsing process.

# Configure parsing options
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.SECTION,
    table_parsing_strategy=TableParsingFormat.TSR,
    table_output_mode=TableOutputMode.MARKDOWN,
)
# Create structured extraction options with the JSON schema
structured_extraction_options = [StructuredExtractionOptions(
    schema_name="ResearchPaper",
    json_schema=json_schema, # Defined elsewhere
)]
# Create enrichment options
enrichment_options = EnrichmentOptions(
    figure_summarization=True,
    figure_summarization_prompt="Summarize the figure beyond the caption by describing the data as it relates to the context of the research paper.",
    table_summarization=True,
    table_summarization_prompt="Summarize the table beyond the caption by describing the data as it relates to the context of the research paper.",
)

parse_id = doc_ai.parse(file_url, parsing_options, structured_extraction_options, enrichment_options)

Step 2: Create Qdrant Collection, Embeddings, and Indices

Create a Qdrant collection and transform your parsed content into embeddings. The key insight is to create rich, searchable content by combining your markdown chunks with table summaries as separate embeddings.

For each document, use the structured data as payload metadata that enables filtering. This means when someone searches for content, they can also filter by author, publication year, conference, or any other extracted metadata.

Create embeddings not just for the text chunks, but also for table and figure summaries. This is crucial because tables often contain the most important data in documents like financial reports or research papers, but they’re typically lost in traditional parsing approaches.

Then, create an index for each kind of data you want to filter on (e.g. author_name).

points = []

# Create embeddings for both text chunks and table summaries
for chunk in chunks:
    embedding = create_embedding(chunk.content)
    payload = {**structured_metadata, 'content': chunk.content}
    points.append(models.PointStruct(
                    id=str(uuid4()),  # Unique ID
                    vector=vembedding,
                    payload=payload
                )
            )
    
for table_summary in table_summaries:
    embedding = create_embedding(table_summary.content)
    payload = {**structured_metadata, 'content': table_summary.content}
    points.append(models.PointStruct(
                    id=str(uuid4()),  # Unique ID
                    vector=vembedding,
                    payload=payload
                )
            )

qdrant_client.upsert(collection_name=collection_name, points=points)

qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="author_names",
    field_schema="keyword",
)

Step 3: Query with Filtering

Now you can perform sophisticated searches that combine semantic similarity with precise filtering. Instead of just searching through all content, you can first filter by specific criteria (like author, date range, or document type) and then perform semantic search within those filtered results.

This two-stage approach—filter first, then search—dramatically improves result relevance. You can also search across different content types, meaning your query might return both relevant text passages and important table summaries that contain the information you’re looking for.

The real power emerges when you build an AI agent that can automatically decide when to apply filters based on the user’s query. If someone asks about “John Doe’s research on machine learning,” the agent recognizes it should filter by author before searching, while a general query like “machine learning techniques” would skip filtering.

# Combine filtering and semantic search
points = qdrant_client.query_points(
    collection_name="research_papers",
    query=model.encode("Does computer science education improve problem solving skills?").tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="author_names",
                match=models.MatchValue(
                    value="William G. Griswold",
                ),
            )
        ]
    ),
    limit=3,
).points

for point in points:
    print(point.payload.get('title', 'Unknown'), point.payload.get('conference_name', 'Unknown'), "score:", point.score)

Best Practices

1. Optimize Chunking Strategy

  • Use semantic chunking (by section) rather than fixed-size chunks
  • Consider document structure when defining chunk boundaries
  • Include context from headers and section titles

2. Leverage Structured Data

  • Extract key metadata for filtering (authors, dates, categories)
  • Create rich payloads that enable complex queries
  • Use structured data to understand document relationships

3. Handle Complex Tables

  • Always include table summaries in your embeddings
  • Consider table context when creating summaries
  • For very large tables, create multiple chunks with proper context

4. Design Effective Filters

  • Create indices on frequently filtered fields
  • Combine multiple filter conditions for precise results
  • Use structured data to enable temporal and categorical filtering

Performance Considerations

  • Batch Processing: Process multiple documents together for efficiency
  • Embedding Caching: Cache embeddings for frequently accessed content
  • Index Optimization: Create appropriate indices based on your query patterns
  • Memory Management: For large document collections, consider streaming approaches

Try It Yourself

Ready to build your own Tensorlake + Qdrant integration? Check out our complete example:

Interactive Notebook

Try the complete Tensorlake + Qdrant integration in this Colab notebook

API Documentation

Explore the full Tensorlake API for advanced parsing options

This integration unlocks the full potential of your documents, creating RAG applications that truly understand complex content and deliver accurate, relevant results.