> ## Documentation Index
> Fetch the complete documentation index at: https://docs.searchcompany.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Step 2: Scrape AI Site

> Internal service that fetches llms.txt from the deployed AI site

# Internal Service: scrape\_ai\_site

Fetches the `llms.txt` file from our deployed AI-optimized site. For product-specific AI articles, also fetches the dedicated product llms file.

## Function Signature

```python theme={null}
async def scrape_ai_site(deployment_url: str, product_name: str = None) -> dict
```

## Parameters

| Parameter        | Type | Description                                       |
| ---------------- | ---- | ------------------------------------------------- |
| `deployment_url` | str  | The Vercel deployment URL of the AI site          |
| `product_name`   | str  | Optional. Product name for product-specific pages |

## Returns

```python theme={null}
# Success
{
    "status": "success",
    "llms_content": "...",  # Root llms.txt (business context)
    "product_llms_content": "..." or None,  # Product-specific llms.txt
    "ai_site_url": "https://..."  # Base URL (cleaned)
}

# Error
{
    "status": "error",
    "error": "Could not fetch llms.txt from AI site"
}
```

## Scalable Product LLMs Architecture

For sites with many products (up to 5,000+), we use a scalable file structure:

| File          | Path                       | Purpose                                  |
| ------------- | -------------------------- | ---------------------------------------- |
| Root llms.txt | `/llms.txt`                | Business overview, key details, FAQs     |
| Product llms  | `/llms/{product-slug}.txt` | Detailed product info, features, pricing |

### For Business AI Articles:

* Fetches only `/llms.txt`
* Uses general business context

### For Product AI Articles:

* Fetches `/llms.txt` (business context)
* Also fetches `/llms/{product-slug}.txt` (product details)
* Falls back to root llms.txt if product file doesn't exist

## Behavior

1. Strips trailing slash from deployment URL
2. Adds Vercel bypass protection header if `VERCEL_BYPASS_PROTECTION_SECRET` is set
3. Fetches `/llms.txt` with up to 5 retries (8 second delay between retries)
4. If `product_name` is provided, also fetches `/llms/{product-slug}.txt`
5. Returns the content for use in Step 3

## Why This Architecture?

| Approach           | Root llms.txt Size  | Problem                       |
| ------------------ | ------------------- | ----------------------------- |
| Everything in root | Grows with products | 5,000 products = massive file |
| Separate files     | Stays small (\~3KB) | Each product file \~1-2KB     |

Benefits:

* **Constant memory** - Each AI article loads \~5KB total
* **No token limits** - Never hits Gemini's context window
* **Fast** - Small files = fast fetches
* **Independent** - Update one product without touching others

## Environment Variables

| Variable                          | Description                                     |
| --------------------------------- | ----------------------------------------------- |
| `VERCEL_BYPASS_PROTECTION_SECRET` | Optional. Bypasses Vercel deployment protection |

## Code Location

```
src/app/apis/cron/create_boosted_page/children/step_2_scrape_ai_site.py
```
