> ## Documentation Index
> Fetch the complete documentation index at: https://docs.searchcompany.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Scrape Website

<Note>
  **Internal Service** — This is not an HTTP endpoint. It's called directly by the `generate-all` orchestrator.
</Note>

## Purpose

Scrapes a website using our **Custom Website Mapper** (for URL discovery) + **Firecrawl Batch Scrape API** (for content extraction). Returns an array of pages with markdown content that all other services use.

Runs in **GROUP 1b** (awaited, blocks GROUP 2).

## Function Signature

```python theme={null}
async def scrape_website(
    url: str, 
    is_root: bool = True, 
    max_pages: int = None
) -> dict
```

## Parameters

| Parameter   | Type   | Default  | Description                                  |
| ----------- | ------ | -------- | -------------------------------------------- |
| `url`       | `str`  | required | The website URL to scrape                    |
| `is_root`   | `bool` | `True`   | Whether this is the root domain              |
| `max_pages` | `int`  | `5000`   | Maximum pages to scrape (production default) |

## Returns

```json theme={null}
{
  "status": "success",
  "pages": [
    {
      "url": "https://example.com/",
      "markdown": "# Example\n\nThis is the homepage...",
      "metadata": {
        "title": "Example - Home",
        "description": "Welcome to Example"
      }
    },
    {
      "url": "https://example.com/about",
      "markdown": "# About Us\n\nWe are a company...",
      "metadata": {
        "title": "About - Example",
        "description": "Learn about Example"
      }
    }
  ],
  "site_map": ["https://example.com/", "https://example.com/about", ...],
  "page_count": 42
}
```

## Scraping Flow

```mermaid theme={null}
sequenceDiagram
    participant Service as Scrape Service
    participant Mapper as Custom Website Mapper
    participant Batch as Firecrawl Batch Scrape
    
    Service->>Mapper: map_website(url, limit=5000)
    Note over Mapper: Multi-strategy discovery
    Mapper-->>Service: {urls: ["...", "..."]}
    Service->>Batch: POST /v2/batch/scrape {urls}
    Note over Batch: Scrapes all URLs in parallel
    Batch-->>Service: {pages: [{url, markdown, metadata}]}
```

### Step 1: Custom Website Mapper (Free)

Our internal mapper uses multiple strategies to discover URLs:

1. **robots.txt parsing** - Extract sitemap directives
2. **sitemap.xml parsing** - Parse sitemap and sitemap index files
3. **HTML link extraction** - Find links in page content
4. **Recursive crawling** - Follow internal links (with depth limit)
5. **URL validation** - Filter out 404s and non-HTML pages

### Step 2: Firecrawl Batch Scrape (Paid)

`POST /v2/batch/scrape` - Scrapes all discovered URLs in parallel, returns markdown content.

<Info>
  This approach is cost-efficient: URL discovery is free (our mapper), only content extraction uses Firecrawl credits.
</Info>

## Used By

| Service                          | Why                                           |
| -------------------------------- | --------------------------------------------- |
| Create AI Website (GROUP 2a)     | Needs page content for markdown replica pages |
| Generate Product LLMs (GROUP 2c) | Uses page content for product context         |

<Info>
  **Note**: Discover Products (GROUP 1d) now uses Shopify's `products.json` API directly and does NOT use scraped pages.
</Info>

## Code Location

```
src/app/shared/
├── mapping/              # Custom Website Mapper (URL discovery)
│   ├── __init__.py
│   ├── mapper.py         # Main map_website function
│   ├── sitemap_parser.py # Sitemap parsing
│   ├── robots_parser.py  # robots.txt parsing
│   ├── html_extractor.py # HTML link extraction
│   └── url_validator.py  # URL validation (filter 404s)
│
└── scraping/             # Content extraction
    ├── __init__.py       # Exports scrape_website
    ├── service.py        # Main scrape_website function
    ├── batch_scrape.py   # Firecrawl Batch Scrape wrapper
    └── config.py         # PAGE_LIMIT = 5000
```

## Configuration

```python theme={null}
# src/app/shared/scraping/config.py
PAGE_LIMIT = 5000       # Production limit
TEST_PAGE_LIMIT = 20    # Used by pytests
```

## Error Handling

```json theme={null}
{
  "status": "error",
  "error": "Firecrawl API timeout after 120s"
}
```

Common errors:

* **Timeout** - Large sites may exceed the 120s timeout
* **Rate limit** - Firecrawl has rate limits per API key
* **Invalid URL** - URL must be accessible and not blocked
* **No URLs found** - Website may block crawlers or have no discoverable pages