Internal Service: scrape_website

Scrapes all pages from the source website using custom mapper + Firecrawl batch scrape.

Important: In the new architecture, scraped pages are ONLY used for markdown replica generation. LLM content (llms.txt, Q&A, data.json) is generated from business_info via Firecrawl agent.

Function Signature

async def scrape_website(
    url: str,
    is_root: bool = True,
    max_pages: int = None
) -> dict

Parameters

Parameter	Type	Default	Description
`url`	str	-	Source website URL
`is_root`	bool	True	Whether this is a root domain scrape
`max_pages`	int	5000	Maximum pages to scrape

Returns

{
    "status": "success",
    "pages": [
        {
            "url": "https://example.com/page",
            "markdown": "# Page Title\n\nContent...",
            "metadata": {
                "title": "Page Title",
                "description": "...",
                "ogImage": "..."
            }
        }
    ],
    "site_map": ["https://example.com/", "https://example.com/about", ...],
    "error": None
}

How Pages Are Used

Component	Uses Pages?	Purpose
llms.txt	❌ No	Generated from business_info
Q&A Pages	❌ No	Generated from business_info
data.json	❌ No	Generated from business_info
Markdown Replicas	✅ Yes	1:1 copy of original content
Page Hashes	✅ Yes	For change detection

Parallel Execution

In the regenerate-fresh-website endpoint, scraping runs in parallel with business info discovery:

Code Location

src/app/shared/scraping/service.py

Pipeline

Custom Mapper - Discovers all URLs on the site
Firecrawl Batch - Scrapes each URL for markdown content
Returns - List of pages with markdown content

Getting Started

Website

Onboarding

Cron

Your Current Setup

Explore

Settings - Toggle

Settings - Business

Settings - Team

Settings - Billing

Settings - Domain

Webhooks

Health

Manual Trigger

Scrape Website

Internal Service: scrape_website

Function Signature

Parameters

Returns

How Pages Are Used

Parallel Execution

Code Location

Pipeline

Getting Started

Website

Onboarding

Cron

Your Current Setup

Explore

Settings - Toggle

Settings - Business

Settings - Team

Settings - Billing

Settings - Domain

Webhooks

Health

Manual Trigger

​Internal Service: scrape_website

​Function Signature

​Parameters

​Returns

​How Pages Are Used

​Parallel Execution

​Code Location

​Pipeline

Internal Service: scrape_website

Function Signature

Parameters

Returns

How Pages Are Used

Parallel Execution

Code Location

Pipeline