Skip to main content

Analyzing AI Risk Disclosures in SEC Filings with Tensorlake & MotherDuck

Track how AI risk disclosures evolved across Microsoft, Google, and Meta from 2021-2025 by parsing 40 SEC filings, extracting structured risk data, and running SQL analytics in MotherDuck.
Try out this example using this Colab Notebook

Track AI Risk Evolution Across Tech Companies

Let’s set the context for this example: you’ll build a document analytics pipeline that processes SEC filings from major tech companies to track how AI risk disclosures have evolved from 2021-2025. You’ll learn how to:
  • Use Tensorlake’s Page Classification to identify risk factor pages with VLMs
  • Extract structured data from only relevant pages using Pydantic schemas
  • Load parsed document data into MotherDuck for SQL analytics
  • Query trends, compare companies, and discover emerging risk patterns

The Challenge

Major tech companies file lengthy SEC reports (100-200+ pages) quarterly. AI-related risk disclosures are scattered throughout these documents, making manual analysis time-consuming and prone to missing critical information.

Our Solution

We’ll analyze 40 SEC filings from Microsoft, Google, and Meta spanning 2021-2025 to:
  1. Use VLMs to identify pages containing AI risk factors (reducing processing from ~200 pages to ~20 per document)
  2. Extract structured risk data from only relevant pages
  3. Store and analyze trends in MotherDuck’s cloud data warehouse
  4. Uncover emerging AI risk patterns and regulatory concerns

Prerequisites

Build Your Document Analytics Pipeline

Set up your environment

The tensorlake package includes DocumentAI for parsing, while duckdb provides the MotherDuck connector.
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install necessary packages

pip install --upgrade tensorlake duckdb==1.3.2

Configure your API keys

Set environment variables for authentication:
export TENSORLAKE_API_KEY="your_tensorlake_api_key"
export MOTHERDUCK_TOKEN="your_motherduck_token"
Or create a .env file:
TENSORLAKE_API_KEY=your_tensorlake_api_key
MOTHERDUCK_TOKEN=your_motherduck_token

Prepare your imports

import os
import json
import duckdb
import pandas as pd
from tensorlake.documentai import DocumentAI, PageClassConfig, StructuredExtractionOptions
from pydantic import BaseModel, Field
from typing import List, Optional

Configure target documents

We’ll analyze SEC filings from three AI leaders. These URLs point to 10-Ks (annual) and 10-Qs (quarterly) filings:
# Company configurations
AI_COMPANIES = {
    "MSFT": "Microsoft",
    "GOOGL": "Alphabet",
    "META": "Meta Platforms"
}

BASE_URL = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev"

FILING_URLS = {
    "GOOGL": [
        f"{BASE_URL}/goog-10q-june-25.pdf",
        f"{BASE_URL}/goog-10q-march-25.pdf",
        f"{BASE_URL}/goog-10k-december-24.pdf",
        f"{BASE_URL}/goog-10q-september-24.pdf",
        f"{BASE_URL}/goog-10q-june-24.pdf",
        # ... more filings
    ],
    "MSFT": [
        f"{BASE_URL}/msft-10k-june-25.pdf",
        f"{BASE_URL}/msft-10q-march-25.pdf",
        # ... more filings
    ],
    "META": [
        f"{BASE_URL}/meta-10q-june-25.pdf",
        f"{BASE_URL}/meta-10q-march-25.pdf",
        # ... more filings
    ]
}

Step 1: Classify Risk Factor Pages with VLMs

Using Tensorlake’s Vision Language Models, we’ll scan all filings to identify pages containing AI-related risk factors. This typically reduces processing from ~200 pages to ~20-30 relevant pages per document:
doc_ai = DocumentAI()

# Define what pages to identify
page_classifications = [
    PageClassConfig(
        name="risk_factors",
        description="Pages that contain risk factors related to AI."
    )
]

# Store classified pages for each document
document_ai_risk_pages = {}

# Classify all filings
for company in AI_COMPANIES:
    for file_url in FILING_URLS[company]:
        print(f"Classifying: {file_url}")
        
        # Classify pages
        parse_id = doc_ai.classify(
            file_url=file_url,
            page_classifications=page_classifications
        )
        
        # Wait for completion
        result = doc_ai.wait_for_completion(parse_id=parse_id)
        
        # Extract risk factor page numbers
        for page_class in result.page_classes:
            if page_class.page_class == "risk_factors":
                document_ai_risk_pages[file_url] = page_class.page_numbers
                print(f"  Found risk pages: {page_class.page_numbers}")

Review classification results

Let’s examine which pages were identified as containing AI risk factors:
for file_url, page_numbers in document_ai_risk_pages.items():
    filename = os.path.basename(file_url)
    print(f"{filename}: Pages {page_numbers}")
You’ll notice variation across companies and time periods—newer filings often have more pages dedicated to AI risks.

Step 2: Define Extraction Schema

We’ll extract structured data about AI risks including categories (Operational, Regulatory, Competitive, etc.), descriptions, and severity indicators:
class AIRiskMention(BaseModel):
    """Individual AI-related risk mention"""
    risk_category: str = Field(
        description="Category: Operational, Regulatory, Competitive, Ethical, Security, Liability"
    )
    risk_description: str = Field(description="Description of the AI risk")
    severity_indicator: Optional[str] = Field(None, description="Severity level if mentioned")
    citation: str = Field(description="Page reference")

class AIRiskExtraction(BaseModel):
    """Complete AI risk data from a filing"""
    company_name: str
    ticker: str
    filing_type: str
    filing_date: str
    fiscal_year: str
    fiscal_quarter: Optional[str] = None
    ai_risk_mentioned: bool
    ai_risk_mentions: List[AIRiskMention] = []
    num_ai_risk_mentions: int = 0
    ai_strategy_mentioned: bool = False
    ai_investment_mentioned: bool = False
    ai_competition_mentioned: bool = False
    regulatory_ai_risk: bool = False

Step 3: Extract Structured Risk Data

Now we extract detailed AI risk information from only the classified pages. This targeted approach processes ~15% of pages while capturing 100% of relevant risk disclosures:
results = {}

for file_url, page_numbers in document_ai_risk_pages.items():
    print(f"Extracting from: {file_url}")
    
    # Convert page numbers to comma-separated string
    page_range = ",".join(str(i) for i in page_numbers)
    print(f"  Processing pages: {page_range}")
    
    # Parse and extract structured data
    result = doc_ai.parse_and_wait(
        file=file_url,
        page_range=page_range,
        structured_extraction_options=[
            StructuredExtractionOptions(
                schema_name="AIRiskExtraction",
                json_schema=AIRiskExtraction
            )
        ]
    )
    
    results[file_url] = result
    print(f"  ✓ Extracted {len(result.structured_data[0].data.get('ai_risk_mentions', []))} risk mentions")

Step 4: Save Extracted Data to JSON

Export each filing’s risk data to JSON files for loading into MotherDuck:
json_files = []

for file_url, result in results.items():
    if result.structured_data:
        # Extract filename from URL
        filename = os.path.basename(file_url)  # "msft-10k-june-25.pdf"
        json_filename = filename.replace('.pdf', '.json')
        
        # Write to file
        with open(json_filename, 'w') as f:
            json.dump(result.structured_data[0].data, f, indent=2, default=str)
        
        json_files.append(json_filename)
        print(f"✓ Saved to {json_filename}")

Step 5: Load Data into MotherDuck

Create a cloud-based data warehouse table in MotherDuck to enable fast SQL analytics across all filings:
# Connect to MotherDuck
con = duckdb.connect('md:ai_risk_factors')

# Drop existing table if present
con.execute("DROP TABLE IF EXISTS ai_risk_filings")

# Prepare rows for loading
rows = []
for filename in json_files:
    with open(filename, 'r') as f:
        data = json.load(f)
    
    # Convert nested array to JSON string for storage
    data['ai_risk_mentions'] = json.dumps(data.get('ai_risk_mentions', []))
    rows.append(data)

# Convert to DataFrame and create table
df = pd.DataFrame(rows)
con.execute("CREATE TABLE ai_risk_filings AS SELECT * FROM df")

# Verify the data
result = con.execute("SELECT * FROM ai_risk_filings").fetchdf()
print(f"Loaded {len(result)} filings into MotherDuck")
print(result.head())
Now the real power emerges—run SQL analytics on your document data to uncover insights.

Query 1: Risk Category Distribution

Understand the breakdown of AI risk categories across all companies:
# Extract all risk mentions from JSON column
all_risks = []
for _, row in con.execute("SELECT company_name, ai_risk_mentions FROM ai_risk_filings").fetchdf().iterrows():
    risks = json.loads(row['ai_risk_mentions'])
    if not risks:
        continue
    for risk in risks:
        all_risks.append({
            'company_name': row['company_name'],
            'risk_category': risk.get('risk_category'),
            'risk_description': risk.get('risk_description')
        })

risks_df = pd.DataFrame(all_risks)

# Count by category
risk_categories = risks_df.groupby('risk_category').agg(
    total_mentions=('risk_category', 'count'),
    companies_mentioning=('company_name', 'nunique')
).reset_index().sort_values('total_mentions', ascending=False)

print("Risk Category Distribution:")
print(risk_categories)
Expected Output:
   risk_category          total_mentions  companies_mentioning
0  Operational           37              3
1  Ethical               28              3
2  Regulatory            26              3
3  Competitive            7              2
4  Security               4              2
5  Liability              3              1

Query 2: Deep Dive into Operational Risks

Extract the most detailed operational risk descriptions from each company:
operational_risks_df = con.execute("""
    SELECT
        company_name,
        ticker,
        json_extract(risk.value, '$.risk_description') as risk_description,
        json_extract(risk.value, '$.citation') as citation
    FROM
        ai_risk_filings,
        json_each(ai_risk_mentions) as risk
    WHERE
        ai_risk_mentions IS NOT NULL
        AND ai_risk_mentions != '[]'
        AND json_extract(risk.value, '$.risk_category') = 'Operational'
""").fetchdf()

# Get longest (most detailed) operational risk per company
operational_risks_df['description_length'] = operational_risks_df['risk_description'].apply(
    lambda x: len(x) if pd.notna(x) else 0
)

top_operational = (
    operational_risks_df
    .sort_values('description_length', ascending=False)
    .groupby('company_name')
    .head(1)
    .reset_index(drop=True)
)

# Print results
for _, row in top_operational.iterrows():
    print(f"\n{row['company_name']} ({row['ticker']}):")
    print("-" * 100)
    print(f"Citation: {row['citation']}")
    print(row['risk_description'][:500])  # First 500 chars
    print()

Key Insights Discovered

Through this analysis pipeline, we’ve:
  • Processed 40 SEC filings (~6,000+ total pages)
  • Identified and extracted AI risk disclosures from relevant pages
  • Built a queryable database of AI risk evolution from 2022-2025

Emerging Trends:

  1. Operational risks dominate (37 mentions) - All three companies express concerns about AI infrastructure costs, development challenges, and potential misuse of AI systems
  2. Ethical considerations intensifying (28 mentions) - Growing focus on bias, harmful content, and societal impact, particularly around generative AI
  3. Regulatory landscape evolving rapidly - 2025 filings show increased mentions of specific regulations (EU AI Act, US AI Executive Order)
  4. New risk categories emerging in 2025:
    • Liability risks - Meta explicitly discussing third-party misuse of open-source AI
    • Intellectual property concerns - Copyright and training data issues becoming prominent
    • Energy dependencies - Companies highlighting reliance on computing power
  5. Risk disclosure volume increasing - Average risk mentions per filing grew from 2.0 in 2022 to 7.0 in 2024

Company-Specific Patterns:

  • Microsoft: Most comprehensive risk disclosures (55 total mentions), heavy focus on operational (19) and ethical (17) risks
  • Meta: Balanced concern across operational (16) and regulatory (16) risks, unique focus on open-source AI liability
  • Alphabet: More measured disclosures (10 total), but showing acceleration in 2025

Adapt This Pipeline for Your Use Case

This pipeline can be adapted for any document analysis need:
  • ESG disclosures - Track sustainability commitments and progress
  • Financial metrics tracking - Extract KPIs across earnings reports
  • Competitive intelligence - Monitor competitor product launches and strategies
  • Regulatory compliance monitoring - Alert on new compliance requirements
The combination of Tensorlake’s intelligent document processing and MotherDuck’s cloud analytics provides a scalable solution for turning unstructured documents into actionable insights.

Clean Up

When you’re done with this example:
# Deactivate virtual environment
deactivate

# Optional: Remove local JSON files
rm *.json

Next Steps

Now that you have the basics down, explore these resources: Try building your own document intelligence pipeline with Tensorlake and MotherDuck today!