Skip to main content
Internal Service — This is not an HTTP endpoint. It’s called directly by the generate-all orchestrator.

Purpose

Creates an AI-optimized website with llms.txt, robots.txt, sitemap.xml, structured data, and markdown replica pages. Deploys to Vercel and assigns a *.searchcompany.dev subdomain. Runs in GROUP 2a (parallel with other tasks after scrape completes).

Function Signature

async def create_ai_website(
    url: str,
    business_id: str,
    force_refresh: bool = False,
    pages: Optional[List[dict]] = None,
    max_pages: Optional[int] = None
) -> dict

Parameters

ParameterTypeDefaultDescription
urlstrrequiredThe business website URL
business_idstrrequiredThe Clerk organization slug
force_refreshboolFalseForce regeneration even if site exists
pagesList[dict]NonePre-scraped pages from GROUP 1a
max_pagesint5000Max pages to scrape if pages not provided

Returns

{
  "status": "success",
  "ai_site_url": "https://my-business-abc123.searchcompany.dev",
  "entity_id": "uuid-...",
  "deployment_url": "https://my-business-abc123-xyz.vercel.app",
  "qa_slugs": ["what-is-business-name", "how-does-business-work"],
  "replica_paths": ["/about", "/pricing", "/contact"]
}

File Generation: Deterministic vs LLM-Generated

The AI website consists of two types of files:

Deterministic Files (Generated by Code)

These files are created programmatically without LLM involvement:
FileSourceDescription
robots.txtstatic_templates.pyStandard robots.txt with AI bot allowances
sitemap.xmlstatic_templates.pyGenerated from Q&A slugs + replica paths
pages/*.jsgenerate_files.pyMarkdown replica pages (1:1 from scraped content)
_app.jshtml_generators.pyNext.js app wrapper
_document.jshtml_generators.pyNext.js document with meta tags
package.jsongenerate_files.pyNext.js dependencies
next.config.jsgenerate_files.pyNext.js configuration
vercel.jsongenerate_files.pyVercel deployment config

LLM-Generated Files (Created by Gemini)

These files are generated by three parallel Gemini calls:
FileGemini CallDescription
llms.txtCall 1AI-readable summary of the business (Markdown)
pages/index.jsCall 2Homepage with Q&A navigation
pages/[slug].jsCall 2Individual Q&A pages (8-15 pages)
data.jsonCall 3Schema.org JSON-LD structured data

Pipeline

The Three Gemini Calls

All three calls run in parallel using asyncio.gather():

Call 1: llms.txt Generation

  • Input: All scraped pages (markdown content)
  • Output: Comprehensive AI-readable summary (500-1500 words)
  • Prompt: build_llms_txt_prompt()

Call 2: Homepage + Q&A Pages

  • Input: All scraped pages + AI site URL
  • Output: JSON with homepage structure + 8-15 Q&A pages
  • Prompt: build_index_html_prompt()

Call 3: Schema.org data.json

  • Input: All scraped pages + source URL
  • Output: JSON-LD structured data
  • Prompt: build_data_json_prompt()

LLMs.txt Structure

# Business Name

> One-line description of the business

## About
Detailed description of what the business does...

## Products & Services
- Product A: Description
- Product B: Description

## Key Features
- Feature 1
- Feature 2

## Frequently Asked Questions - [Business Name] - About
- What is [Business Name]? [Link to Q&A page]
- How does [Business Name] work? [Link to Q&A page]
...

## Contact
- Website: https://example.com
- Email: [email protected]

Markdown Replica Pages

For each scraped page, creates a markdown replica at /{slug}:
Source: https://example.com/about
Replica: https://my-business.searchcompany.dev/about
These replicas:
  • Preserve the original content in markdown format
  • Are optimized for AI crawlers
  • Include structured metadata
  • Have collision detection (adds 4-char suffix if slug conflicts with Q&A page)

Product LLMs Architecture (Scalable)

Product-specific llms files are generated by GROUP 3b (Generate Product LLMs) which runs in parallel with GROUP 3a (Product Prompts) after products are discovered.

File Structure

FileWhen CreatedPurpose
/llms.txtGROUP 2a (Create AI Website)Business overview, key details, FAQs
/llms/{product-slug}.txtGROUP 3b (Generate Product LLMs)Detailed product info, features, pricing

Why Separate Files?

For sites with many products (up to 5,000+):
ApproachRoot llms.txt SizeProblem
Everything in rootGrows with products5,000 products = massive file, hits token limits
Separate filesStays small (~3KB)Each product file ~1-2KB, fetched on demand

Flow

Vercel Deployment

Code Location

src/app/shared/ai_website/
├── __init__.py
├── service.py           # Main create_ai_website function
├── check_url.py         # URL validation
├── llm_organize.py      # 3 parallel Gemini calls
├── generate_files.py    # File generation (deterministic + from LLM output)
├── deploy.py            # Vercel deployment
├── assign_domain.py     # Subdomain assignment
├── html_generators.py   # HTML/JS page generation
├── static_templates.py  # robots.txt, sitemap.xml templates
└── product_llms.py      # Product llms.txt generation (used by discover_products)

src/app/shared/prompts/templates/ai_website/
├── llms_txt_generation.py           # Prompt for Call 1 (root llms.txt)
├── product_llms_txt_generation.py   # Prompt for product-specific llms.txt
├── index_js_homepage_generation.py  # Prompt for Call 2
└── data_json_generation.py          # Prompt for Call 3

Database Updates

Updates the ai_sites table:
INSERT INTO ai_sites (
  entity_id,
  ai_site_url,
  vercel_deployment_url,
  page_hashes,
  site_map,
  deployed_at
) VALUES (...)

Error Handling

{
  "status": "error",
  "error": "Vercel deployment failed: rate limit exceeded"
}
If deployment fails, the error is logged but onboarding continues. The site can be regenerated later via the manual trigger endpoint.