> ## Documentation Index
> Fetch the complete documentation index at: https://docs.searchcompany.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Scrape Website

> Internal service that scrapes the source website for markdown replicas

# Internal Service: scrape\_website

Scrapes all pages from the source website using custom mapper + Firecrawl batch scrape.

<Info>
  **Important**: In the new architecture, scraped pages are ONLY used for **markdown replica generation**. LLM content (llms.txt, Q\&A, data.json) is generated from `business_info` via Firecrawl agent.
</Info>

## Function Signature

```python theme={null}
async def scrape_website(
    url: str,
    is_root: bool = True,
    max_pages: int = None
) -> dict
```

## Parameters

| Parameter   | Type | Default | Description                          |
| ----------- | ---- | ------- | ------------------------------------ |
| `url`       | str  | -       | Source website URL                   |
| `is_root`   | bool | True    | Whether this is a root domain scrape |
| `max_pages` | int  | 5000    | Maximum pages to scrape              |

## Returns

```python theme={null}
{
    "status": "success",
    "pages": [
        {
            "url": "https://example.com/page",
            "markdown": "# Page Title\n\nContent...",
            "metadata": {
                "title": "Page Title",
                "description": "...",
                "ogImage": "..."
            }
        }
    ],
    "site_map": ["https://example.com/", "https://example.com/about", ...],
    "error": None
}
```

## How Pages Are Used

| Component             | Uses Pages? | Purpose                       |
| --------------------- | ----------- | ----------------------------- |
| **llms.txt**          | ❌ No        | Generated from business\_info |
| **Q\&A Pages**        | ❌ No        | Generated from business\_info |
| **data.json**         | ❌ No        | Generated from business\_info |
| **Markdown Replicas** | ✅ Yes       | 1:1 copy of original content  |
| **Page Hashes**       | ✅ Yes       | For change detection          |

## Parallel Execution

In the regenerate-fresh-website endpoint, scraping runs **in parallel** with business info discovery:

```mermaid theme={null}
flowchart LR
    Request --> BI["Discover Business Info"]
    Request --> Scrape["Scrape Website"]
    
    BI -->|business_info| LLM["LLM Organize"]
    Scrape -->|pages| Replicas["Markdown Replicas"]
    Scrape -->|pages| Hash["Hash Pages"]
```

## Code Location

```
src/app/shared/scraping/service.py
```

### Pipeline

1. **Custom Mapper** - Discovers all URLs on the site
2. **Firecrawl Batch** - Scrapes each URL for markdown content
3. **Returns** - List of pages with markdown content
