Internal Service: scrape_website
Scrapes all pages from the source website using custom mapper + Firecrawl batch scrape.
Important: In the new architecture, scraped pages are ONLY used for markdown replica generation. LLM content (llms.txt, Q&A, data.json) is generated from business_info via Firecrawl agent.
Function Signature
async def scrape_website(
url: str,
is_root: bool = True,
max_pages: int = None
) -> dict
Parameters
| Parameter | Type | Default | Description |
|---|
url | str | - | Source website URL |
is_root | bool | True | Whether this is a root domain scrape |
max_pages | int | 5000 | Maximum pages to scrape |
Returns
{
"status": "success",
"pages": [
{
"url": "https://example.com/page",
"markdown": "# Page Title\n\nContent...",
"metadata": {
"title": "Page Title",
"description": "...",
"ogImage": "..."
}
}
],
"site_map": ["https://example.com/", "https://example.com/about", ...],
"error": None
}
How Pages Are Used
| Component | Uses Pages? | Purpose |
|---|
| llms.txt | ❌ No | Generated from business_info |
| Q&A Pages | ❌ No | Generated from business_info |
| data.json | ❌ No | Generated from business_info |
| Markdown Replicas | ✅ Yes | 1:1 copy of original content |
| Page Hashes | ✅ Yes | For change detection |
Parallel Execution
In the regenerate-fresh-website endpoint, scraping runs in parallel with business info discovery:
Code Location
src/app/shared/scraping/service.py
Pipeline
- Custom Mapper - Discovers all URLs on the site
- Firecrawl Batch - Scrapes each URL for markdown content
- Returns - List of pages with markdown content