Skip to main content

Internal Service: scrape_ai_site

Fetches the llms.txt file from our deployed AI-optimized site. For product-specific boosted pages, also fetches the dedicated product llms file.

Function Signature

async def scrape_ai_site(deployment_url: str, product_name: str = None) -> dict

Parameters

ParameterTypeDescription
deployment_urlstrThe Vercel deployment URL of the AI site
product_namestrOptional. Product name for product-specific pages

Returns

# Success
{
    "status": "success",
    "llms_content": "...",  # Root llms.txt (business context)
    "product_llms_content": "..." or None,  # Product-specific llms.txt
    "ai_site_url": "https://..."  # Base URL (cleaned)
}

# Error
{
    "status": "error",
    "error": "Could not fetch llms.txt from AI site"
}

Scalable Product LLMs Architecture

For sites with many products (up to 5,000+), we use a scalable file structure:
FilePathPurpose
Root llms.txt/llms.txtBusiness overview, key details, FAQs
Product llms/llms/{product-slug}.txtDetailed product info, features, pricing

For Business Boosted Pages:

  • Fetches only /llms.txt
  • Uses general business context

For Product Boosted Pages:

  • Fetches /llms.txt (business context)
  • Also fetches /llms/{product-slug}.txt (product details)
  • Falls back to root llms.txt if product file doesn’t exist

Behavior

  1. Strips trailing slash from deployment URL
  2. Adds Vercel bypass protection header if VERCEL_BYPASS_PROTECTION_SECRET is set
  3. Fetches /llms.txt with up to 5 retries (8 second delay between retries)
  4. If product_name is provided, also fetches /llms/{product-slug}.txt
  5. Returns the content for use in Step 3

Why This Architecture?

ApproachRoot llms.txt SizeProblem
Everything in rootGrows with products5,000 products = massive file
Separate filesStays small (~3KB)Each product file ~1-2KB
Benefits:
  • Constant memory - Each boosted page loads ~5KB total
  • No token limits - Never hits Gemini’s context window
  • Fast - Small files = fast fetches
  • Independent - Update one product without touching others

Environment Variables

VariableDescription
VERCEL_BYPASS_PROTECTION_SECRETOptional. Bypasses Vercel deployment protection

Code Location

src/app/apis/cron/create_boosted_page/children/step_2_scrape_ai_site.py