Skip to main content

Internal Service: scrape_website

Scrapes all pages from the source website using custom mapper + Firecrawl batch scrape.
Important: In the new architecture, scraped pages are ONLY used for markdown replica generation. LLM content (llms.txt, Q&A, data.json) is generated from business_info via Firecrawl agent.

Function Signature

async def scrape_website(
    url: str,
    is_root: bool = True,
    max_pages: int = None
) -> dict

Parameters

ParameterTypeDefaultDescription
urlstr-Source website URL
is_rootboolTrueWhether this is a root domain scrape
max_pagesint5000Maximum pages to scrape

Returns

{
    "status": "success",
    "pages": [
        {
            "url": "https://example.com/page",
            "markdown": "# Page Title\n\nContent...",
            "metadata": {
                "title": "Page Title",
                "description": "...",
                "ogImage": "..."
            }
        }
    ],
    "site_map": ["https://example.com/", "https://example.com/about", ...],
    "error": None
}

How Pages Are Used

ComponentUses Pages?Purpose
llms.txt❌ NoGenerated from business_info
Q&A Pages❌ NoGenerated from business_info
data.json❌ NoGenerated from business_info
Markdown Replicas✅ Yes1:1 copy of original content
Page Hashes✅ YesFor change detection

Parallel Execution

In the regenerate-fresh-website endpoint, scraping runs in parallel with business info discovery:

Code Location

src/app/shared/scraping/service.py

Pipeline

  1. Custom Mapper - Discovers all URLs on the site
  2. Firecrawl Batch - Scrapes each URL for markdown content
  3. Returns - List of pages with markdown content