> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scraper to MongoDB Atlas

> Build a scalable web scraper that stores vector embeddings in MongoDB Atlas.

<Card title="View Source Code" icon="github" href="https://github.com/tensorlakeai/cookbooks/tree/main/web-scraper">
  Check out the full source code for this example on GitHub.
</Card>

This tutorial demonstrates how to build a production-grade **Web Scraper** that crawls websites, processes content into clean Markdown, generates embeddings using Voyage AI, and stores them in MongoDB Atlas Vector Search.

## Overview

This application showcases the power of Tensorlake's parallel processing capabilities:

1. **Parallel Crawling**: Uses Breadth-First Search (BFS) with Tensorlake's `.map()` to fetch multiple pages concurrently at each depth level.
2. **Headless Browsing**: Utilizes **PyDoll** (based on Chromium) to render JavaScript-heavy websites.
3. **Content Cleaning**: Converts HTML and PDFs to clean Markdown, automatically removing boilerplate like headers, footers, and ads.
4. **Vector Embeddings**: Generates high-quality embeddings for document chunks using **Voyage AI**.
5. **Vector Search**: Stores the processed chunks and embeddings directly into **MongoDB Atlas** for RAG applications.

## Prerequisites

* **Python 3.11+**
* **Tensorlake Account** and CLI installed.
* **MongoDB Atlas** cluster URI.
* **Voyage AI** API Key.

## Implementation

The application is defined in a single file, `scraper_to_atlas.py`. It defines two custom runtime images: one for scraping (with Chromium) and one for embedding (lightweight).

### 1. Define Dependencies and Images

```python theme={null}
from tensorlake.applications import Image

# Image with Chromium, pydoll, and dependencies for web scraping
scraper_image = (
    Image(name="scraper-to-atlas-image", base_image="python:3.11.0")
    .env("DEBIAN_FRONTEND", "noninteractive")
    .run("apt-get update && apt-get install -y chromium ...") # System deps
    .run("pip install pydoll-python tensorlake beautifulsoup4 markdownify pymupdf4llm")
)

# Image for embedding and MongoDB operations
embedding_image = (
    Image(name="embedding-image", base_image="python:3.11.0")
    .run("pip install tensorlake voyageai pymongo")
)
```

### 2. Main Scraping Logic (`scraper_to_atlas.py`)

The `@application` entry point orchestrates the crawling process. It manages the BFS queue and dispatches parallel tasks using `fetch_and_convert.map()`.

```python theme={null}
@application()
@function(secrets=["VOYAGE_API_KEY", "MONGO_URI"])
def scrape_and_embed(input: ScrapeAndEmbedInput) -> dict:
    # ... setup BFS ...

    # Phase 1: Parallel BFS
    for depth in range(max_depth + 1):
        # ... deduce URLs to fetch ...
        
        # Parallel fetch all URLs at this depth level using map()
        results = fetch_and_convert.map(urls_to_fetch)
        
        # ... process results and collect new links ...

    # Phase 2: Process PDFs in parallel
    if pdf_urls:
        pdf_results = fetch_and_convert_pdf.map(list(pdf_urls))

    # Phase 3: Generate embeddings and store
    embed_and_store(all_documents, ...)
```

### 3. Page Fetching and Conversion

The `fetch_and_convert` function runs in the `scraper_image` and uses PyDoll to render pages.

```python theme={null}
@function(image=scraper_image, timeout=120, memory=4)
def fetch_and_convert(url: str) -> dict:
    return asyncio.run(_fetch_and_convert_async(url))

async def _fetch_and_convert_async(url: str) -> dict:
    async with Chrome() as browser:
        page = await browser.new_page()
        await page.goto(url)
        html = await page.content()
        # ... extract title and links ...
    
    # Convert to clean markdown
    markdown = _html_to_markdown(html)
    chunks = _chunk_text(markdown)
    return {"url": url, "chunks": chunks, ...}
```

### 4. Embedding and Storage

The `embed_and_store` function runs in the `embedding_image` and handles interaction with Voyage AI and MongoDB.

```python theme={null}
@function(image=embedding_image, secrets=["VOYAGE_API_KEY", "MONGO_URI"])
def embed_and_store(documents, mongo_uri, voyage_api_key, ...):
    # Initialize Voyage AI
    vo = voyageai.Client(api_key=voyage_api_key)
    
    # Generate embeddings
    embeddings = vo.embed(texts=[d["text"] for d in documents], model="voyage-4-large")
    
    # Store in MongoDB
    client = pymongo.MongoClient(mongo_uri)
    collection = client[db_name][col_name]
    collection.insert_many([{...} for ...])
```

## Running Locally

1. Set your environment variables:
   ```bash theme={null}
   export MONGO_URI="mongodb+srv://..."
   export VOYAGE_API_KEY="voyage-..."
   ```

2. Run the application:
   ```bash theme={null}
   python scraper_to_atlas.py
   ```

## Deploying to Tensorlake

Deploy your scalable scraper to the cloud.

```bash theme={null}
tl secrets set MONGO_URI="mongodb+srv://..."
tl secrets set VOYAGE_API_KEY="voyage-..."
tl app deploy scraper_to_atlas.py
```

Your scraper will now run in the cloud, automatically scaling to handle hundreds of pages in parallel!
