Scrape Website

Internal Service — This is not an HTTP endpoint. It’s called directly by the generate-all orchestrator.

Purpose

Scrapes a website using our Custom Website Mapper (for URL discovery) + Firecrawl Batch Scrape API (for content extraction). Returns an array of pages with markdown content that all other services use. Runs in GROUP 1b (awaited, blocks GROUP 2).

Function Signature

async def scrape_website(
    url: str, 
    is_root: bool = True, 
    max_pages: int = None
) -> dict

Parameters

Parameter	Type	Default	Description
`url`	`str`	required	The website URL to scrape
`is_root`	`bool`	`True`	Whether this is the root domain
`max_pages`	`int`	`5000`	Maximum pages to scrape (production default)

Returns

{
  "status": "success",
  "pages": [
    {
      "url": "https://example.com/",
      "markdown": "# Example\n\nThis is the homepage...",
      "metadata": {
        "title": "Example - Home",
        "description": "Welcome to Example"
      }
    },
    {
      "url": "https://example.com/about",
      "markdown": "# About Us\n\nWe are a company...",
      "metadata": {
        "title": "About - Example",
        "description": "Learn about Example"
      }
    }
  ],
  "site_map": ["https://example.com/", "https://example.com/about", ...],
  "page_count": 42
}

Scraping Flow

Step 1: Custom Website Mapper (Free)

Our internal mapper uses multiple strategies to discover URLs:

robots.txt parsing - Extract sitemap directives
sitemap.xml parsing - Parse sitemap and sitemap index files
HTML link extraction - Find links in page content
Recursive crawling - Follow internal links (with depth limit)
URL validation - Filter out 404s and non-HTML pages

Step 2: Firecrawl Batch Scrape (Paid)

POST /v2/batch/scrape - Scrapes all discovered URLs in parallel, returns markdown content.

This approach is cost-efficient: URL discovery is free (our mapper), only content extraction uses Firecrawl credits.

Used By

Service	Why
Create AI Website (GROUP 2a)	Needs page content for markdown replica pages
Generate Product LLMs (GROUP 2c)	Uses page content for product context

Note: Discover Products (GROUP 1d) now uses Shopify’s products.json API directly and does NOT use scraped pages.

Code Location

src/app/shared/
├── mapping/              # Custom Website Mapper (URL discovery)
│   ├── __init__.py
│   ├── mapper.py         # Main map_website function
│   ├── sitemap_parser.py # Sitemap parsing
│   ├── robots_parser.py  # robots.txt parsing
│   ├── html_extractor.py # HTML link extraction
│   └── url_validator.py  # URL validation (filter 404s)
│
└── scraping/             # Content extraction
    ├── __init__.py       # Exports scrape_website
    ├── service.py        # Main scrape_website function
    ├── batch_scrape.py   # Firecrawl Batch Scrape wrapper
    └── config.py         # PAGE_LIMIT = 5000

Configuration

# src/app/shared/scraping/config.py
PAGE_LIMIT = 5000       # Production limit
TEST_PAGE_LIMIT = 20    # Used by pytests

Error Handling

{
  "status": "error",
  "error": "Firecrawl API timeout after 120s"
}

Common errors:

Timeout - Large sites may exceed the 120s timeout
Rate limit - Firecrawl has rate limits per API key
Invalid URL - URL must be accessible and not blocked
No URLs found - Website may block crawlers or have no discoverable pages

Getting Started

Website

Onboarding

Cron

Your Current Setup

Explore

Settings - Toggle

Settings - Business

Settings - Team

Settings - Billing

Settings - Domain

Webhooks

Health

Manual Trigger

Purpose

Function Signature

Parameters

Returns

Scraping Flow

Step 1: Custom Website Mapper (Free)

Step 2: Firecrawl Batch Scrape (Paid)

Used By

Code Location

Configuration

Error Handling

Getting Started

Website

Onboarding

Cron

Your Current Setup

Explore

Settings - Toggle

Settings - Business

Settings - Team

Settings - Billing

Settings - Domain

Webhooks

Health

Manual Trigger

​Purpose

​Function Signature

​Parameters

​Returns

​Scraping Flow

​Step 1: Custom Website Mapper (Free)

​Step 2: Firecrawl Batch Scrape (Paid)

​Used By

​Code Location

​Configuration

​Error Handling

Purpose

Function Signature

Parameters

Returns

Scraping Flow

Step 1: Custom Website Mapper (Free)

Step 2: Firecrawl Batch Scrape (Paid)

Used By

Code Location

Configuration

Error Handling