Skip to main content

View Source Code

Check out the full source code for this example on GitHub.
This tutorial demonstrates how to build a production-grade Web Scraper that crawls websites, processes content into clean Markdown, generates embeddings using Voyage AI, and stores them in MongoDB Atlas Vector Search.

Overview

This application showcases the power of Tensorlake’s parallel processing capabilities:
  1. Parallel Crawling: Uses Breadth-First Search (BFS) with Tensorlake’s .map() to fetch multiple pages concurrently at each depth level.
  2. Headless Browsing: Utilizes PyDoll (based on Chromium) to render JavaScript-heavy websites.
  3. Content Cleaning: Converts HTML and PDFs to clean Markdown, automatically removing boilerplate like headers, footers, and ads.
  4. Vector Embeddings: Generates high-quality embeddings for document chunks using Voyage AI.
  5. Vector Search: Stores the processed chunks and embeddings directly into MongoDB Atlas for RAG applications.

Prerequisites

  • Python 3.11+
  • Tensorlake Account and CLI installed.
  • MongoDB Atlas cluster URI.
  • Voyage AI API Key.

Implementation

The application is defined in a single file, scraper_to_atlas.py. It defines two custom runtime images: one for scraping (with Chromium) and one for embedding (lightweight).

1. Define Dependencies and Images

from tensorlake.applications import Image

# Image with Chromium, pydoll, and dependencies for web scraping
scraper_image = (
    Image(name="scraper-to-atlas-image", base_image="python:3.11.0")
    .env("DEBIAN_FRONTEND", "noninteractive")
    .run("apt-get update && apt-get install -y chromium ...") # System deps
    .run("pip install pydoll-python tensorlake beautifulsoup4 markdownify pymupdf4llm")
)

# Image for embedding and MongoDB operations
embedding_image = (
    Image(name="embedding-image", base_image="python:3.11.0")
    .run("pip install tensorlake voyageai pymongo")
)

2. Main Scraping Logic (scraper_to_atlas.py)

The @application entry point orchestrates the crawling process. It manages the BFS queue and dispatches parallel tasks using fetch_and_convert.map().
@application()
@function(secrets=["VOYAGE_API_KEY", "MONGO_URI"])
def scrape_and_embed(input: ScrapeAndEmbedInput) -> dict:
    # ... setup BFS ...

    # Phase 1: Parallel BFS
    for depth in range(max_depth + 1):
        # ... deduce URLs to fetch ...
        
        # Parallel fetch all URLs at this depth level using map()
        results = fetch_and_convert.map(urls_to_fetch)
        
        # ... process results and collect new links ...

    # Phase 2: Process PDFs in parallel
    if pdf_urls:
        pdf_results = fetch_and_convert_pdf.map(list(pdf_urls))

    # Phase 3: Generate embeddings and store
    embed_and_store(all_documents, ...)

3. Page Fetching and Conversion

The fetch_and_convert function runs in the scraper_image and uses PyDoll to render pages.
@function(image=scraper_image, timeout=120, memory=4)
def fetch_and_convert(url: str) -> dict:
    return asyncio.run(_fetch_and_convert_async(url))

async def _fetch_and_convert_async(url: str) -> dict:
    async with Chrome() as browser:
        page = await browser.new_page()
        await page.goto(url)
        html = await page.content()
        # ... extract title and links ...
    
    # Convert to clean markdown
    markdown = _html_to_markdown(html)
    chunks = _chunk_text(markdown)
    return {"url": url, "chunks": chunks, ...}

4. Embedding and Storage

The embed_and_store function runs in the embedding_image and handles interaction with Voyage AI and MongoDB.
@function(image=embedding_image, secrets=["VOYAGE_API_KEY", "MONGO_URI"])
def embed_and_store(documents, mongo_uri, voyage_api_key, ...):
    # Initialize Voyage AI
    vo = voyageai.Client(api_key=voyage_api_key)
    
    # Generate embeddings
    embeddings = vo.embed(texts=[d["text"] for d in documents], model="voyage-4-large")
    
    # Store in MongoDB
    client = pymongo.MongoClient(mongo_uri)
    collection = client[db_name][col_name]
    collection.insert_many([{...} for ...])

Running Locally

  1. Set your environment variables:
    export MONGO_URI="mongodb+srv://..."
    export VOYAGE_API_KEY="voyage-..."
    
  2. Run the application:
    python scraper_to_atlas.py
    

Deploying to Tensorlake

Deploy your scalable scraper to the cloud.
tensorlake secrets set MONGO_URI="mongodb+srv://..."
tensorlake secrets set VOYAGE_API_KEY="voyage-..."
tensorlake deploy scraper_to_atlas.py
Your scraper will now run in the cloud, automatically scaling to handle hundreds of pages in parallel!