Use this file to discover all available pages before exploring further.
View Source Code
Check out the full source code for this example on GitHub.
This tutorial demonstrates how to build a production-grade Web Scraper that crawls websites, processes content into clean Markdown, generates embeddings using Voyage AI, and stores them in MongoDB Atlas Vector Search.
The application is defined in a single file, scraper_to_atlas.py. It defines two custom runtime images: one for scraping (with Chromium) and one for embedding (lightweight).
The @application entry point orchestrates the crawling process. It manages the BFS queue and dispatches parallel tasks using fetch_and_convert.map().
@application()@function(secrets=["VOYAGE_API_KEY", "MONGO_URI"])def scrape_and_embed(input: ScrapeAndEmbedInput) -> dict: # ... setup BFS ... # Phase 1: Parallel BFS for depth in range(max_depth + 1): # ... deduce URLs to fetch ... # Parallel fetch all URLs at this depth level using map() results = fetch_and_convert.map(urls_to_fetch) # ... process results and collect new links ... # Phase 2: Process PDFs in parallel if pdf_urls: pdf_results = fetch_and_convert_pdf.map(list(pdf_urls)) # Phase 3: Generate embeddings and store embed_and_store(all_documents, ...)
The embed_and_store function runs in the embedding_image and handles interaction with Voyage AI and MongoDB.
@function(image=embedding_image, secrets=["VOYAGE_API_KEY", "MONGO_URI"])def embed_and_store(documents, mongo_uri, voyage_api_key, ...): # Initialize Voyage AI vo = voyageai.Client(api_key=voyage_api_key) # Generate embeddings embeddings = vo.embed(texts=[d["text"] for d in documents], model="voyage-4-large") # Store in MongoDB client = pymongo.MongoClient(mongo_uri) collection = client[db_name][col_name] collection.insert_many([{...} for ...])