Documentation Index
Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
Use this file to discover all available pages before exploring further.
Analyzing AI Risk Disclosures in SEC Filings with Tensorlake & MotherDuck
Track how AI risk disclosures evolved across Microsoft, Google, and Meta from 2021-2025 by parsing 40 SEC filings, extracting structured risk data, and running SQL analytics in MotherDuck.
Track AI Risk Evolution Across Tech Companies
Let’s set the context for this example: you’ll build a document analytics pipeline that processes SEC filings from major tech companies to track how AI risk disclosures have evolved from 2021-2025. You’ll learn how to:
- Use Tensorlake’s Page Classification to identify risk factor pages with VLMs
- Extract structured data from only relevant pages using Pydantic schemas
- Load parsed document data into MotherDuck for SQL analytics
- Query trends, compare companies, and discover emerging risk patterns
The Challenge
Major tech companies file lengthy SEC reports (100-200+ pages) quarterly. AI-related risk disclosures are scattered throughout these documents, making manual analysis time-consuming and prone to missing critical information.
Our Solution
We’ll analyze 40 SEC filings from Microsoft, Google, and Meta spanning 2021-2025 to:
- Use VLMs to identify pages containing AI risk factors (reducing processing from ~200 pages to ~20 per document)
- Extract structured risk data from only relevant pages
- Store and analyze trends in MotherDuck’s cloud data warehouse
- Uncover emerging AI risk patterns and regulatory concerns
Prerequisites
Build Your Document Analytics Pipeline
Set up your environment
The tensorlake package includes DocumentAI for parsing, while duckdb provides the MotherDuck connector.
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install necessary packages
pip install --upgrade tensorlake duckdb==1.3.2
Set environment variables for authentication:
export TENSORLAKE_API_KEY="your_tensorlake_api_key"
export MOTHERDUCK_TOKEN="your_motherduck_token"
Or create a .env file:
TENSORLAKE_API_KEY=your_tensorlake_api_key
MOTHERDUCK_TOKEN=your_motherduck_token
Prepare your imports
import os
import json
import duckdb
import pandas as pd
from tensorlake.documentai import DocumentAI, PageClassConfig, StructuredExtractionOptions
from pydantic import BaseModel, Field
from typing import List, Optional
We’ll analyze SEC filings from three AI leaders. These URLs point to 10-Ks (annual) and 10-Qs (quarterly) filings:
# Company configurations
AI_COMPANIES = {
"MSFT": "Microsoft",
"GOOGL": "Alphabet",
"META": "Meta Platforms"
}
BASE_URL = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev"
FILING_URLS = {
"GOOGL": [
f"{BASE_URL}/goog-10q-june-25.pdf",
f"{BASE_URL}/goog-10q-march-25.pdf",
f"{BASE_URL}/goog-10k-december-24.pdf",
f"{BASE_URL}/goog-10q-september-24.pdf",
f"{BASE_URL}/goog-10q-june-24.pdf",
# ... more filings
],
"MSFT": [
f"{BASE_URL}/msft-10k-june-25.pdf",
f"{BASE_URL}/msft-10q-march-25.pdf",
# ... more filings
],
"META": [
f"{BASE_URL}/meta-10q-june-25.pdf",
f"{BASE_URL}/meta-10q-march-25.pdf",
# ... more filings
]
}
Step 1: Classify Risk Factor Pages with VLMs
Using Tensorlake’s Vision Language Models, we’ll scan all filings to identify pages containing AI-related risk factors. This typically reduces processing from ~200 pages to ~20-30 relevant pages per document:
doc_ai = DocumentAI()
# Define what pages to identify
page_classifications = [
PageClassConfig(
name="risk_factors",
description="Pages that contain risk factors related to AI."
)
]
# Store classified pages for each document
document_ai_risk_pages = {}
# Classify all filings
for company in AI_COMPANIES:
for file_url in FILING_URLS[company]:
print(f"Classifying: {file_url}")
# Classify pages
parse_id = doc_ai.classify(
file_url=file_url,
page_classifications=page_classifications
)
# Wait for completion
result = doc_ai.wait_for_completion(parse_id=parse_id)
# Extract risk factor page numbers
for page_class in result.page_classes:
if page_class.page_class == "risk_factors":
document_ai_risk_pages[file_url] = page_class.page_numbers
print(f" Found risk pages: {page_class.page_numbers}")
Review classification results
Let’s examine which pages were identified as containing AI risk factors:
for file_url, page_numbers in document_ai_risk_pages.items():
filename = os.path.basename(file_url)
print(f"{filename}: Pages {page_numbers}")
You’ll notice variation across companies and time periods—newer filings often have more pages dedicated to AI risks.
We’ll extract structured data about AI risks including categories (Operational, Regulatory, Competitive, etc.), descriptions, and severity indicators:
class AIRiskMention(BaseModel):
"""Individual AI-related risk mention"""
risk_category: str = Field(
description="Category: Operational, Regulatory, Competitive, Ethical, Security, Liability"
)
risk_description: str = Field(description="Description of the AI risk")
severity_indicator: Optional[str] = Field(None, description="Severity level if mentioned")
citation: str = Field(description="Page reference")
class AIRiskExtraction(BaseModel):
"""Complete AI risk data from a filing"""
company_name: str
ticker: str
filing_type: str
filing_date: str
fiscal_year: str
fiscal_quarter: Optional[str] = None
ai_risk_mentioned: bool
ai_risk_mentions: List[AIRiskMention] = []
num_ai_risk_mentions: int = 0
ai_strategy_mentioned: bool = False
ai_investment_mentioned: bool = False
ai_competition_mentioned: bool = False
regulatory_ai_risk: bool = False
Now we extract detailed AI risk information from only the classified pages. This targeted approach processes ~15% of pages while capturing 100% of relevant risk disclosures:
results = {}
for file_url, page_numbers in document_ai_risk_pages.items():
print(f"Extracting from: {file_url}")
# Convert page numbers to comma-separated string
page_range = ",".join(str(i) for i in page_numbers)
print(f" Processing pages: {page_range}")
# Parse and extract structured data
result = doc_ai.parse_and_wait(
file=file_url,
page_range=page_range,
structured_extraction_options=[
StructuredExtractionOptions(
schema_name="AIRiskExtraction",
json_schema=AIRiskExtraction
)
]
)
results[file_url] = result
print(f" ✓ Extracted {len(result.structured_data[0].data.get('ai_risk_mentions', []))} risk mentions")
Export each filing’s risk data to JSON files for loading into MotherDuck:
json_files = []
for file_url, result in results.items():
if result.structured_data:
# Extract filename from URL
filename = os.path.basename(file_url) # "msft-10k-june-25.pdf"
json_filename = filename.replace('.pdf', '.json')
# Write to file
with open(json_filename, 'w') as f:
json.dump(result.structured_data[0].data, f, indent=2, default=str)
json_files.append(json_filename)
print(f"✓ Saved to {json_filename}")
Step 5: Load Data into MotherDuck
Create a cloud-based data warehouse table in MotherDuck to enable fast SQL analytics across all filings:
# Connect to MotherDuck
con = duckdb.connect('md:ai_risk_factors')
# Drop existing table if present
con.execute("DROP TABLE IF EXISTS ai_risk_filings")
# Prepare rows for loading
rows = []
for filename in json_files:
with open(filename, 'r') as f:
data = json.load(f)
# Convert nested array to JSON string for storage
data['ai_risk_mentions'] = json.dumps(data.get('ai_risk_mentions', []))
rows.append(data)
# Convert to DataFrame and create table
df = pd.DataFrame(rows)
con.execute("CREATE TABLE ai_risk_filings AS SELECT * FROM df")
# Verify the data
result = con.execute("SELECT * FROM ai_risk_filings").fetchdf()
print(f"Loaded {len(result)} filings into MotherDuck")
print(result.head())
Step 6: Analyze Risk Trends with SQL
Now the real power emerges—run SQL analytics on your document data to uncover insights.
Query 1: Risk Category Distribution
Understand the breakdown of AI risk categories across all companies:
# Extract all risk mentions from JSON column
all_risks = []
for _, row in con.execute("SELECT company_name, ai_risk_mentions FROM ai_risk_filings").fetchdf().iterrows():
risks = json.loads(row['ai_risk_mentions'])
if not risks:
continue
for risk in risks:
all_risks.append({
'company_name': row['company_name'],
'risk_category': risk.get('risk_category'),
'risk_description': risk.get('risk_description')
})
risks_df = pd.DataFrame(all_risks)
# Count by category
risk_categories = risks_df.groupby('risk_category').agg(
total_mentions=('risk_category', 'count'),
companies_mentioning=('company_name', 'nunique')
).reset_index().sort_values('total_mentions', ascending=False)
print("Risk Category Distribution:")
print(risk_categories)
Expected Output:
risk_category total_mentions companies_mentioning
0 Operational 37 3
1 Ethical 28 3
2 Regulatory 26 3
3 Competitive 7 2
4 Security 4 2
5 Liability 3 1
Query 2: Deep Dive into Operational Risks
Extract the most detailed operational risk descriptions from each company:
operational_risks_df = con.execute("""
SELECT
company_name,
ticker,
json_extract(risk.value, '$.risk_description') as risk_description,
json_extract(risk.value, '$.citation') as citation
FROM
ai_risk_filings,
json_each(ai_risk_mentions) as risk
WHERE
ai_risk_mentions IS NOT NULL
AND ai_risk_mentions != '[]'
AND json_extract(risk.value, '$.risk_category') = 'Operational'
""").fetchdf()
# Get longest (most detailed) operational risk per company
operational_risks_df['description_length'] = operational_risks_df['risk_description'].apply(
lambda x: len(x) if pd.notna(x) else 0
)
top_operational = (
operational_risks_df
.sort_values('description_length', ascending=False)
.groupby('company_name')
.head(1)
.reset_index(drop=True)
)
# Print results
for _, row in top_operational.iterrows():
print(f"\n{row['company_name']} ({row['ticker']}):")
print("-" * 100)
print(f"Citation: {row['citation']}")
print(row['risk_description'][:500]) # First 500 chars
print()
Key Insights Discovered
Through this analysis pipeline, we’ve:
- Processed 40 SEC filings (~6,000+ total pages)
- Identified and extracted AI risk disclosures from relevant pages
- Built a queryable database of AI risk evolution from 2022-2025
Emerging Trends:
-
Operational risks dominate (37 mentions) - All three companies express concerns about AI infrastructure costs, development challenges, and potential misuse of AI systems
-
Ethical considerations intensifying (28 mentions) - Growing focus on bias, harmful content, and societal impact, particularly around generative AI
-
Regulatory landscape evolving rapidly - 2025 filings show increased mentions of specific regulations (EU AI Act, US AI Executive Order)
-
New risk categories emerging in 2025:
- Liability risks - Meta explicitly discussing third-party misuse of open-source AI
- Intellectual property concerns - Copyright and training data issues becoming prominent
- Energy dependencies - Companies highlighting reliance on computing power
-
Risk disclosure volume increasing - Average risk mentions per filing grew from 2.0 in 2022 to 7.0 in 2024
Company-Specific Patterns:
- Microsoft: Most comprehensive risk disclosures (55 total mentions), heavy focus on operational (19) and ethical (17) risks
- Meta: Balanced concern across operational (16) and regulatory (16) risks, unique focus on open-source AI liability
- Alphabet: More measured disclosures (10 total), but showing acceleration in 2025
Adapt This Pipeline for Your Use Case
This pipeline can be adapted for any document analysis need:
- ESG disclosures - Track sustainability commitments and progress
- Financial metrics tracking - Extract KPIs across earnings reports
- Competitive intelligence - Monitor competitor product launches and strategies
- Regulatory compliance monitoring - Alert on new compliance requirements
The combination of Tensorlake’s intelligent document processing and MotherDuck’s cloud analytics provides a scalable solution for turning unstructured documents into actionable insights.
Clean Up
When you’re done with this example:
# Deactivate virtual environment
deactivate
# Optional: Remove local JSON files
rm *.json
Next Steps
Now that you have the basics down, explore these resources:
Try building your own document intelligence pipeline with Tensorlake and MotherDuck today!