Build more accurate RAG applications by combining Tensorlake’s comprehensive document parsing with Qdrant’s vector search capabilities
Combining Tensorlake’s document parsing engine with Qdrant’s vector database creates powerful RAG applications with more complete,
accurate embeddings and advanced filtering capabilities. This integration enables you to build knowledge systems that understand
complex documents while providing fast, relevant search results.
Start by defining what structured data you want to extract from your documents. This might include metadata like authors,
titles, dates, or domain-specific information. Configure Tensorlake to chunk your documents semantically (by sections like abstract,
introduction, methodology) rather than using arbitrary size limits.When you send documents to Tensorlake’s parse endpoint, you’ll receive three key outputs:
Markdown chunks that preserve reading order and document structure
Structured data in JSON format based on your defined schema
Table and figure summaries that capture the meaning of visual content
The power here is getting all this information in a single API call, ensuring nothing is lost in the parsing process.
Copy
Ask AI
# Configure parsing optionsparsing_options = ParsingOptions( chunking_strategy=ChunkingStrategy.SECTION, table_parsing_strategy=TableParsingFormat.TSR, table_output_mode=TableOutputMode.MARKDOWN,)# Create structured extraction options with the JSON schemastructured_extraction_options = [StructuredExtractionOptions( schema_name="ResearchPaper", json_schema=json_schema, # Defined elsewhere)]# Create enrichment optionsenrichment_options = EnrichmentOptions( figure_summarization=True, figure_summarization_prompt="Summarize the figure beyond the caption by describing the data as it relates to the context of the research paper.", table_summarization=True, table_summarization_prompt="Summarize the table beyond the caption by describing the data as it relates to the context of the research paper.",)parse_id = doc_ai.parse(file_url, parsing_options, structured_extraction_options, enrichment_options)
Step 2: Create Qdrant Collection, Embeddings, and Indices
Create a Qdrant collection and transform your parsed content into embeddings. The key insight is to create rich, searchable
content by combining your markdown chunks with table summaries as separate embeddings.For each document, use the structured data as payload metadata that enables filtering. This means when someone searches for
content, they can also filter by author, publication year, conference, or any other extracted metadata.Create embeddings not just for the text chunks, but also for table and figure summaries. This is crucial because tables often
contain the most important data in documents like financial reports or research papers, but they’re typically lost in traditional
parsing approaches.Then, create an index for each kind of data you want to filter on (e.g. author_name).
Copy
Ask AI
points = []# Create embeddings for both text chunks and table summariesfor chunk in chunks: embedding = create_embedding(chunk.content) payload = {**structured_metadata, 'content': chunk.content} points.append(models.PointStruct( id=str(uuid4()), # Unique ID vector=vembedding, payload=payload ) )for table_summary in table_summaries: embedding = create_embedding(table_summary.content) payload = {**structured_metadata, 'content': table_summary.content} points.append(models.PointStruct( id=str(uuid4()), # Unique ID vector=vembedding, payload=payload ) )qdrant_client.upsert(collection_name=collection_name, points=points)qdrant_client.create_payload_index( collection_name=collection_name, field_name="author_names", field_schema="keyword",)
Now you can perform sophisticated searches that combine semantic similarity with precise filtering. Instead of just searching
through all content, you can first filter by specific criteria (like author, date range, or document type) and then perform
semantic search within those filtered results.This two-stage approach—filter first, then search—dramatically improves result relevance. You can also search across different
content types, meaning your query might return both relevant text passages and important table summaries that contain the
information you’re looking for.The real power emerges when you build an AI agent that can automatically decide when to apply filters based on the user’s query.
If someone asks about “John Doe’s research on machine learning,” the agent recognizes it should filter by author before searching,
while a general query like “machine learning techniques” would skip filtering.
Copy
Ask AI
# Combine filtering and semantic searchpoints = qdrant_client.query_points( collection_name="research_papers", query=model.encode("Does computer science education improve problem solving skills?").tolist(), query_filter=models.Filter( must=[ models.FieldCondition( key="author_names", match=models.MatchValue( value="William G. Griswold", ), ) ] ), limit=3,).pointsfor point in points: print(point.payload.get('title', 'Unknown'), point.payload.get('conference_name', 'Unknown'), "score:", point.score)
This integration unlocks the full potential of your documents, creating RAG applications that truly understand complex content and deliver accurate, relevant results.