Skip to main content
ChromaDB is an open-source vector database designed for AI applications. When combined with Tensorlake’s document parsing, you can build RAG systems where every generated statement links directly to its source, complete with page numbers and bounding boxes. This integration is critical for legal, medical, financial, and compliance applications where source verification isn’t optional.
Run this end-to-end in Colab:

Why Use Tensorlake + ChromaDB?

The Problem:
  • Traditional RAG can’t prove where answers come from
  • Users have no way to verify AI-generated claims
  • Compliance and audit requirements demand source attribution
  • Hallucinations are impossible to trace back to their origin
The Solution: Tensorlake preserves spatial metadata (page numbers, bounding boxes) during parsing. Combined with ChromaDB’s vector search, you get citation-aware RAG: every AI response includes exact source locations users can verify. Key Benefits:
  • Citation provenance - Track every claim to specific paragraphs in source documents
  • Audit-ready outputs - Meet regulatory requirements with verifiable source attribution
  • Zero hallucination detection - Instantly verify whether answers are grounded in your documents
  • Production-ready - Handle legal contracts, medical research, and financial reports with confidence

Installation

pip install tensorlake chromadb

Quick Start

Step 1: Parse Documents with Spatial Metadata

Tensorlake captures not just text, but coordinates of every element:
from tensorlake.documentai import DocumentAI, ParseStatus

doc_ai = DocumentAI()
file_id = doc_ai.upload("research_paper.pdf")
result = doc_ai.parse_and_wait(file_id)

assert result.status == ParseStatus.SUCCESSFUL

# Access parsed pages with spatial metadata
pages = result.pages  # Each page contains fragments with bounding boxes

Step 2: Build Citation-Ready Chunks

Create sections with embedded citation anchors while storing coordinates separately:
import json

def build_citation_chunks(result, file_name: str):
    """
    Chunk document by sections, embedding citation anchors inline
    while storing bounding boxes and page numbers in metadata.
    """
    sections = []
    current_section = None
    
    # Group by section headers
    for page in result.pages:
        page_num = page.page_number
        
        for fragment in page.page_fragments:
            content = fragment.content.content.strip()
            bbox = fragment.bbox
            
            if fragment.fragment_type == "section_header":
                if current_section:
                    sections.append(current_section)
                current_section = [{
                    "page_number": page_num,
                    "text": content,
                    "bbox": bbox
                }]
            elif content and current_section is not None:
                current_section.append({
                    "page_number": page_num,
                    "text": content,
                    "bbox": bbox
                })
    
    if current_section:
        sections.append(current_section)
    
    # Build chunks with citation anchors
    chunks, metadatas, ids = [], [], []
    
    for sec_idx, section in enumerate(sections, start=1):
        citation_map = {}
        text_lines = []
        
        for elem_idx, element in enumerate(section, start=1):
            anchor_id = f"S{sec_idx}.{elem_idx}"
            text_lines.append(f"<c id={anchor_id}>{element['text']}</c>")
            citation_map[anchor_id] = {
                "page_number": element["page_number"],
                "bbox": element["bbox"]
            }
        
        chunks.append("\n".join(text_lines))
        metadatas.append({
            "file": file_name,
            "citations": json.dumps(citation_map)
        })
        ids.append(f"section-{sec_idx}")
    
    return chunks, metadatas, ids

Step 3: Store in ChromaDB

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()

collection = client.create_collection(
    name="citation_aware_rag",
    embedding_function=embedding_functions.OpenAIEmbeddingFunction(
        api_key=os.environ["OPENAI_API_KEY"],
        model_name="text-embedding-3-small"
    )
)

# Add chunks with citation metadata
chunks, metadatas, ids = build_citation_chunks(result, "research_paper.pdf")
collection.add(documents=chunks, metadatas=metadatas, ids=ids)

Step 4: Query with Automatic Citation Extraction

from openai import OpenAI
from pydantic import BaseModel
from typing import List

class CitedResponse(BaseModel):
    response: str
    citations: List[str]

def query_with_citations(query: str, k: int = 3):
    # Retrieve relevant chunks
    results = collection.query(query_texts=[query], n_results=k)
    context = "\n---\n".join(results["documents"][0])
    
    # Build citation lookup map
    citation_map = {}
    for metadata in results["metadatas"][0]:
        citations = json.loads(metadata["citations"])
        file_name = metadata["file"]
        
        for anchor_id, coords in citations.items():
            citation_map[anchor_id] = {
                "file": file_name,
                "page": coords["page_number"],
                "bbox": coords["bbox"]
            }
    
    # Generate response with citation extraction
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    completion = client.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """Answer questions based on provided context. 
                Extract citation IDs (format: <c id=S#.#>) and include them 
                in your citations list. Only cite sources you actually used."""
            },
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        response_format=CitedResponse
    )
    
    response = completion.choices[0].message.parsed
    
    # Map citations to source locations
    print(f"Answer: {response.response}\n")
    print("Sources:")
    for anchor_id in response.citations:
        cite = citation_map.get(anchor_id)
        if cite:
            print(f"  • {cite['file']} | Page {cite['page']}")
            print(f"    Location: {cite['bbox']}")
    
    return response, citation_map

# Example usage
query_with_citations("What methodology did the researchers use?")

Output

Answer: The researchers used SMOTE (Synthetic Minority Over-sampling 
Technique) to address class imbalance in their training dataset, which 
improved model performance on minority classes.

Sources:
  • research_paper.pdf | Page 3
    Location: {'x': 72, 'y': 234, 'width': 450, 'height': 24}
  • research_paper.pdf | Page 3
    Location: {'x': 72, 'y': 260, 'width': 450, 'height': 36}
Every answer includes verifiable source locations. Users can jump directly to the cited text in the original document.

How Citation Anchors Work

Traditional RAG loses document structure during chunking. You can’t trace an AI answer back to its source. Citation-aware RAG changes the architecture:
  1. During parsing: Tensorlake captures bounding boxes and page numbers for every text element
  2. During chunking: We embed citation anchors (<c id=S1.2>) directly in the text while storing coordinates in metadata
  3. During retrieval: Citation anchors travel with the text, so the LLM sees which sentences came from where
  4. During generation: The LLM naturally references citation IDs when answering
  5. After generation: We map citation IDs back to page numbers and bounding boxes for verification
The key insight: Citation anchors stay with the text during embedding, ensuring semantic relevance, while spatial coordinates stay in metadata, keeping embeddings clean.

Use Cases

Extract contract clauses with exact page and paragraph references for court filings. Citation metadata creates automatic audit trails.

Medical Research

Build literature review systems that cite specific sentences from research papers. Meet peer review standards with verifiable references.

Financial Compliance

Generate audit reports where every figure traces back to source statements in regulatory filings. Essential for SOX and SEC compliance.

Insurance Claims Processing

Verify policy coverage with direct links to relevant policy document sections. Speed up claims review while maintaining accuracy.

Pharmaceutical Documentation

Meet FDA requirements by citing specific sections in clinical trial reports. Citation metadata enables regulatory audit trails.

Best Practices

1. Optimize Chunking Strategy

Chunk by semantic boundaries (sections, subsections) rather than character counts. Include section headers for better retrieval context.

2. Validate Citation Accuracy

Implement validation to ensure citation integrity:
def validate_citations(response, citation_map, context):
    """Verify that cited anchors exist in context."""
    for anchor_id in response.citations:
        if anchor_id not in citation_map:
            print(f"⚠️  Invalid citation: {anchor_id}")
        elif f"<c id={anchor_id}>" not in context:
            print(f"⚠️  Citation not found in context: {anchor_id}")

3. Adapt Citation Formats

Customize anchor formats for your domain:
# Legal: Paragraph numbering
anchor = f{section_idx}.{elem_idx}"

# Academic: Section.subsection.item
anchor = f"{section_title}.{elem_idx}"

# Medical: Protocol step IDs
anchor = f"PROTOCOL_{section_idx}_STEP_{elem_idx}"

4. Handle Tables and Figures

For complex documents with tables, use Tensorlake’s table summaries as separate chunks with their own citation anchors.

5. Use Persistent Storage

For production, use chromadb.PersistentClient(path="./chroma_db") to avoid re-embedding on restart.

Complete Example

Try the full working example with research paper analysis:

Citation-Aware RAG Notebook

Complete code walkthrough including citation validation and accuracy metrics

Resources

Need Help?

Join our community to discuss citation-aware RAG:
I