Skip to main content
Internal Service β€” This is not an HTTP endpoint. It’s called directly by the generate-all orchestrator.

Purpose

Scrapes a website using our Custom Website Mapper (for URL discovery) + Firecrawl Batch Scrape API (for content extraction). Returns an array of pages with markdown content that all other services use. Runs in GROUP 1a (awaited, blocks GROUP 2).

Function Signature

async def scrape_website(
    url: str, 
    is_root: bool = True, 
    max_pages: int = None
) -> dict

Parameters

ParameterTypeDefaultDescription
urlstrrequiredThe website URL to scrape
is_rootboolTrueWhether this is the root domain
max_pagesint5000Maximum pages to scrape (production default)

Returns

{
  "status": "success",
  "pages": [
    {
      "url": "https://example.com/",
      "markdown": "# Example\n\nThis is the homepage...",
      "metadata": {
        "title": "Example - Home",
        "description": "Welcome to Example"
      }
    },
    {
      "url": "https://example.com/about",
      "markdown": "# About Us\n\nWe are a company...",
      "metadata": {
        "title": "About - Example",
        "description": "Learn about Example"
      }
    }
  ],
  "site_map": ["https://example.com/", "https://example.com/about", ...],
  "page_count": 42
}

Scraping Flow

Step 1: Custom Website Mapper (Free)

Our internal mapper uses multiple strategies to discover URLs:
  1. robots.txt parsing - Extract sitemap directives
  2. sitemap.xml parsing - Parse sitemap and sitemap index files
  3. HTML link extraction - Find links in page content
  4. Recursive crawling - Follow internal links (with depth limit)
  5. URL validation - Filter out 404s and non-HTML pages

Step 2: Firecrawl Batch Scrape (Paid)

POST /v2/batch/scrape - Scrapes all discovered URLs in parallel, returns markdown content.
This approach is cost-efficient: URL discovery is free (our mapper), only content extraction uses Firecrawl credits.

Used By

ServiceWhy
Create AI WebsiteNeeds page content to generate llms.txt and replica pages
Business PromptsAnalyzes content to generate relevant prompts
Discover ProductsFinds products/services mentioned in content
Product PromptsUses content context for product-specific prompts

Code Location

src/app/shared/
β”œβ”€β”€ mapping/              # Custom Website Mapper (URL discovery)
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ mapper.py         # Main map_website function
β”‚   β”œβ”€β”€ sitemap_parser.py # Sitemap parsing
β”‚   β”œβ”€β”€ robots_parser.py  # robots.txt parsing
β”‚   β”œβ”€β”€ html_extractor.py # HTML link extraction
β”‚   └── url_validator.py  # URL validation (filter 404s)
β”‚
└── scraping/             # Content extraction
    β”œβ”€β”€ __init__.py       # Exports scrape_website
    β”œβ”€β”€ service.py        # Main scrape_website function
    β”œβ”€β”€ batch_scrape.py   # Firecrawl Batch Scrape wrapper
    └── config.py         # PAGE_LIMIT = 5000

Configuration

# src/app/shared/scraping/config.py
PAGE_LIMIT = 5000       # Production limit
TEST_PAGE_LIMIT = 20    # Used by pytests

Error Handling

{
  "status": "error",
  "error": "Firecrawl API timeout after 120s"
}
Common errors:
  • Timeout - Large sites may exceed the 120s timeout
  • Rate limit - Firecrawl has rate limits per API key
  • Invalid URL - URL must be accessible and not blocked
  • No URLs found - Website may block crawlers or have no discoverable pages