Skip to main content

MotherDuck

MotherDuck is a serverless analytics platform built on DuckDB, designed for fast, collaborative data analysis. When combined with Tensorlake’s document parsing and serverless agentic application orchestration, you get end-to-end document intelligence pipelines. Tensorlake runs your Python application that ingests and extracts information from PDF or other forms of unstructured data, and lands them into MotherDuck. Deploying these pipelines takes minutes. This integration is essential to integrate information from unstructured data sources into DuckDB for analytics.

Integration Architecture

There are two main ways of integrating Tensorlake with MotherDuck:
  1. Document Ingestion API: Use Tensorlake’s Document Ingestion API from your existing workflows to extract structured data from documents, then load the results into MotherDuck.
  2. Full Pipeline on Tensorlake: Build the entire pipeline of ingestion, transformation, and writing to MotherDuck on Tensorlake’s platform. These pipelines are exposed as HTTP APIs and run whenever data is ingested, eliminating infrastructure management and scaling concerns.

Installation

pip install tensorlake duckdb==1.3.2

Quick Start: Simple Document-to-Database Integration

This example demonstrates the core integration pattern between Tensorlake’s DocumentAI and MotherDuck.

Step 1: Extract Structured Data from a Document

Define a schema and extract structured data using Tensorlake:
from tensorlake.documentai import DocumentAI, StructuredExtractionOptions
from pydantic import BaseModel, Field
from typing import List

# Define your extraction schema
class CompanyInfo(BaseModel):
    """Basic company information from a document"""
    company_name: str = Field(description="Name of the company")
    revenue: str = Field(description="Annual revenue")
    industry: str = Field(description="Primary industry")

# Initialize DocumentAI
doc_ai = DocumentAI()

# Extract structured data
result = doc_ai.parse_and_wait(
    file="https://example.com/company-report.pdf",
    structured_extraction_options=[
        StructuredExtractionOptions(
            schema_name="CompanyInfo",
            json_schema=CompanyInfo
        )
    ]
)

extracted_data = result.structured_data[0].data

Step 2: Load Data into MotherDuck

Connect to MotherDuck and insert the extracted data:
import duckdb
import pandas as pd

# Connect to MotherDuck (uses $motherduck_token environment variable)
con = duckdb.connect('md:my_database')

# Convert extracted data to DataFrame
df = pd.DataFrame([extracted_data])

# Create table and load data
con.execute("CREATE OR REPLACE TABLE companies AS SELECT * FROM df")

# Verify the data
result = con.execute("SELECT * FROM companies").fetchdf()
print(result)

Step 3: Query Your Data

Run SQL analytics on the document data:
# Example: Query companies by industry
industry_summary = con.execute("""
    SELECT 
        industry,
        COUNT(*) as company_count,
        AVG(CAST(revenue AS DECIMAL)) as avg_revenue
    FROM companies
    GROUP BY industry
    ORDER BY company_count DESC
""").fetchdf()

print(industry_summary)

How the Integration Works

The integration follows a straightforward pipeline:
  1. Document Processing: Tensorlake’s DocumentAI parses documents and extracts structured data based on your Pydantic schemas
  2. Data Transformation: Extracted data is converted into a format compatible with DuckDB (typically DataFrames or dictionaries)
  3. Database Loading: Data is loaded into MotherDuck tables using DuckDB’s Python API
  4. SQL Analytics: Run complex queries, joins, and aggregations on your document data using standard SQL

Best Practices

1. Design Schemas for Queryability

Structure your Pydantic models to match your analysis needs:
# Good schema for SQL analytics
class CompanyFiling(BaseModel):
    company_name: str  # For GROUP BY queries
    filing_date: str   # For time series analysis
    fiscal_year: str   # For year-over-year comparisons
    risk_count: int    # For aggregations
    risks: List[dict]  # Nested data for detailed queries

2. Handle Nested Data Appropriately

Use DuckDB’s JSON functions for nested structures:
# Extract from nested arrays
query = """
    SELECT 
        company_name,
        json_extract(item.value, '$.field') as extracted_field
    FROM companies,
    json_each(nested_array) as item
"""

result = con.execute(query).fetchdf()

3. Process Multiple Documents

When working with multiple documents, extract from all documents then load in bulk:
document_urls = ["url1.pdf", "url2.pdf", "url3.pdf"]
all_extractions = []

# Extract data from all documents
for url in document_urls:
    result = doc_ai.parse_and_wait(
        file=url,
        structured_extraction_options=[
            StructuredExtractionOptions(
                schema_name="CompanyInfo",
                json_schema=CompanyInfo
            )
        ]
    )
    all_extractions.append(result.structured_data[0].data)

# Load all data at once
df = pd.DataFrame(all_extractions)
con.execute("CREATE OR REPLACE TABLE companies AS SELECT * FROM df")

Complete Example with Advanced Features

For a more comprehensive example including page classification, multi-document processing, and advanced analytics, see our blog post: Building Document Intelligence Pipelines with Tensorlake and MotherDuck

What’s Next?

Build on this foundation: