Internal Service β This is not an HTTP endpoint. Itβs called directly by the
generate-all orchestrator.Purpose
Scrapes a website using our Custom Website Mapper (for URL discovery) + Firecrawl Batch Scrape API (for content extraction). Returns an array of pages with markdown content that all other services use. Runs in GROUP 1a (awaited, blocks GROUP 2).Function Signature
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | The website URL to scrape |
is_root | bool | True | Whether this is the root domain |
max_pages | int | 5000 | Maximum pages to scrape (production default) |
Returns
Scraping Flow
Step 1: Custom Website Mapper (Free)
Our internal mapper uses multiple strategies to discover URLs:- robots.txt parsing - Extract sitemap directives
- sitemap.xml parsing - Parse sitemap and sitemap index files
- HTML link extraction - Find links in page content
- Recursive crawling - Follow internal links (with depth limit)
- URL validation - Filter out 404s and non-HTML pages
Step 2: Firecrawl Batch Scrape (Paid)
POST /v2/batch/scrape - Scrapes all discovered URLs in parallel, returns markdown content.
This approach is cost-efficient: URL discovery is free (our mapper), only content extraction uses Firecrawl credits.
Used By
| Service | Why |
|---|---|
| Create AI Website | Needs page content to generate llms.txt and replica pages |
| Business Prompts | Analyzes content to generate relevant prompts |
| Discover Products | Finds products/services mentioned in content |
| Product Prompts | Uses content context for product-specific prompts |
Code Location
Configuration
Error Handling
- Timeout - Large sites may exceed the 120s timeout
- Rate limit - Firecrawl has rate limits per API key
- Invalid URL - URL must be accessible and not blocked
- No URLs found - Website may block crawlers or have no discoverable pages