View Source Code
Check out the full source code for this example on GitHub.
Overview
This application showcases the power of Tensorlake’s parallel processing capabilities:- Parallel Crawling: Uses Breadth-First Search (BFS) with Tensorlake’s
.map()to fetch multiple pages concurrently at each depth level. - Headless Browsing: Utilizes PyDoll (based on Chromium) to render JavaScript-heavy websites.
- Content Cleaning: Converts HTML and PDFs to clean Markdown, automatically removing boilerplate like headers, footers, and ads.
- Vector Embeddings: Generates high-quality embeddings for document chunks using Voyage AI.
- Vector Search: Stores the processed chunks and embeddings directly into MongoDB Atlas for RAG applications.
Prerequisites
- Python 3.11+
- Tensorlake Account and CLI installed.
- MongoDB Atlas cluster URI.
- Voyage AI API Key.
Implementation
The application is defined in a single file,scraper_to_atlas.py. It defines two custom runtime images: one for scraping (with Chromium) and one for embedding (lightweight).
1. Define Dependencies and Images
2. Main Scraping Logic (scraper_to_atlas.py)
The @application entry point orchestrates the crawling process. It manages the BFS queue and dispatches parallel tasks using fetch_and_convert.map().
3. Page Fetching and Conversion
Thefetch_and_convert function runs in the scraper_image and uses PyDoll to render pages.
4. Embedding and Storage
Theembed_and_store function runs in the embedding_image and handles interaction with Voyage AI and MongoDB.
Running Locally
-
Set your environment variables:
-
Run the application: