MotherDuck
MotherDuck is a serverless analytics platform built on DuckDB, designed for fast, collaborative data analysis. When combined with Tensorlake’s document parsing and serverless agentic application orchestration, you get end-to-end document intelligence pipelines. Tensorlake runs your Python application that ingests and extracts information from PDF or other forms of unstructured data, and lands them into MotherDuck. Deploying these pipelines takes minutes. This integration is essential to integrate information from unstructured data sources into DuckDB for analytics.Integration Architecture
There are two main ways of integrating Tensorlake with MotherDuck:- Document Ingestion API: Use Tensorlake’s Document Ingestion API from your existing workflows to extract structured data from documents, then load the results into MotherDuck.
- Full Pipeline on Tensorlake: Build the entire pipeline of ingestion, transformation, and writing to MotherDuck on Tensorlake’s platform. These pipelines are exposed as HTTP APIs and run whenever data is ingested, eliminating infrastructure management and scaling concerns.
Installation
Quick Start: Simple Document-to-Database Integration
This example demonstrates the core integration pattern between Tensorlake’s DocumentAI and MotherDuck.Step 1: Extract Structured Data from a Document
Define a schema and extract structured data using Tensorlake:Step 2: Load Data into MotherDuck
Connect to MotherDuck and insert the extracted data:Step 3: Query Your Data
Run SQL analytics on the document data:How the Integration Works
The integration follows a straightforward pipeline:- Document Processing: Tensorlake’s DocumentAI parses documents and extracts structured data based on your Pydantic schemas
- Data Transformation: Extracted data is converted into a format compatible with DuckDB (typically DataFrames or dictionaries)
- Database Loading: Data is loaded into MotherDuck tables using DuckDB’s Python API
- SQL Analytics: Run complex queries, joins, and aggregations on your document data using standard SQL
Best Practices
1. Design Schemas for Queryability
Structure your Pydantic models to match your analysis needs:2. Handle Nested Data Appropriately
Use DuckDB’s JSON functions for nested structures:3. Process Multiple Documents
When working with multiple documents, extract from all documents then load in bulk:Complete Example with Advanced Features
For a more comprehensive example including page classification, multi-document processing, and advanced analytics, see our blog post: Building Document Intelligence Pipelines with Tensorlake and MotherDuckWhat’s Next?
Build on this foundation:- Structured Extraction Guide - Define custom schemas
- Applications Documentation - Deploy production pipelines
- MotherDuck Documentation - Learn more about MotherDuck features