Internal Service — This is not an HTTP endpoint. It’s called directly by the generate-all orchestrator.
Purpose
Creates an AI-optimized website with llms.txt, robots.txt, sitemap.xml, structured data, and markdown replica pages. Deploys to Vercel and assigns a *.searchcompany.dev subdomain.
Runs in GROUP 2a (parallel with other tasks after scrape completes).
Function Signature
async def create_ai_website(
url: str,
business_id: str,
force_refresh: bool = False,
pages: Optional[List[dict]] = None,
max_pages: Optional[int] = None
) -> dict
Parameters
| Parameter | Type | Default | Description |
|---|
url | str | required | The business website URL |
business_id | str | required | The Clerk organization slug |
force_refresh | bool | False | Force regeneration even if site exists |
pages | List[dict] | None | Pre-scraped pages from GROUP 1a |
max_pages | int | 5000 | Max pages to scrape if pages not provided |
Returns
{
"status": "success",
"ai_site_url": "https://my-business-abc123.searchcompany.dev",
"entity_id": "uuid-...",
"deployment_url": "https://my-business-abc123-xyz.vercel.app",
"qa_slugs": ["what-is-business-name", "how-does-business-work"],
"replica_paths": ["/about", "/pricing", "/contact"]
}
File Generation: Deterministic vs LLM-Generated
The AI website consists of two types of files:
Deterministic Files (Generated by Code)
These files are created programmatically without LLM involvement:
| File | Source | Description |
|---|
robots.txt | static_templates.py | Standard robots.txt with AI bot allowances |
sitemap.xml | static_templates.py | Generated from Q&A slugs + replica paths |
pages/*.js | generate_files.py | Markdown replica pages (1:1 from scraped content) |
_app.js | html_generators.py | Next.js app wrapper |
_document.js | html_generators.py | Next.js document with meta tags |
package.json | generate_files.py | Next.js dependencies |
next.config.js | generate_files.py | Next.js configuration |
vercel.json | generate_files.py | Vercel deployment config |
LLM-Generated Files (Created by Gemini)
These files are generated by three parallel Gemini calls:
| File | Gemini Call | Description |
|---|
llms.txt | Call 1 | AI-readable summary of the business (Markdown) |
pages/index.js | Call 2 | Homepage with Q&A navigation |
pages/[slug].js | Call 2 | Individual Q&A pages (8-15 pages) |
data.json | Call 3 | Schema.org JSON-LD structured data |
Pipeline
The Three Gemini Calls
All three calls run in parallel using asyncio.gather():
Call 1: llms.txt Generation
- Input: All scraped pages (markdown content)
- Output: Comprehensive AI-readable summary (500-1500 words)
- Prompt:
build_llms_txt_prompt()
Call 2: Homepage + Q&A Pages
- Input: All scraped pages + AI site URL
- Output: JSON with homepage structure + 8-15 Q&A pages
- Prompt:
build_index_html_prompt()
Call 3: Schema.org data.json
- Input: All scraped pages + source URL
- Output: JSON-LD structured data
- Prompt:
build_data_json_prompt()
LLMs.txt Structure
# Business Name
> One-line description of the business
## About
Detailed description of what the business does...
## Products & Services
- Product A: Description
- Product B: Description
## Key Features
- Feature 1
- Feature 2
## Frequently Asked Questions - [Business Name] - About
- What is [Business Name]? [Link to Q&A page]
- How does [Business Name] work? [Link to Q&A page]
...
## Contact
- Website: https://example.com
- Email: [email protected]
Markdown Replica Pages
For each scraped page, creates a markdown replica at /{slug}:
Source: https://example.com/about
Replica: https://my-business.searchcompany.dev/about
These replicas:
- Preserve the original content in markdown format
- Are optimized for AI crawlers
- Include structured metadata
- Have collision detection (adds 4-char suffix if slug conflicts with Q&A page)
Product LLMs Architecture (Scalable)
Product-specific llms files are generated by GROUP 3b (Generate Product LLMs) which runs in parallel with GROUP 3a (Product Prompts) after products are discovered.
File Structure
| File | When Created | Purpose |
|---|
/llms.txt | GROUP 2a (Create AI Website) | Business overview, key details, FAQs |
/llms/{product-slug}.txt | GROUP 3b (Generate Product LLMs) | Detailed product info, features, pricing |
Why Separate Files?
For sites with many products (up to 5,000+):
| Approach | Root llms.txt Size | Problem |
|---|
| Everything in root | Grows with products | 5,000 products = massive file, hits token limits |
| Separate files | Stays small (~3KB) | Each product file ~1-2KB, fetched on demand |
Flow
Vercel Deployment
Code Location
src/app/shared/ai_website/
├── __init__.py
├── service.py # Main create_ai_website function
├── check_url.py # URL validation
├── llm_organize.py # 3 parallel Gemini calls
├── generate_files.py # File generation (deterministic + from LLM output)
├── deploy.py # Vercel deployment
├── assign_domain.py # Subdomain assignment
├── html_generators.py # HTML/JS page generation
├── static_templates.py # robots.txt, sitemap.xml templates
└── product_llms.py # Product llms.txt generation (used by discover_products)
src/app/shared/prompts/templates/ai_website/
├── llms_txt_generation.py # Prompt for Call 1 (root llms.txt)
├── product_llms_txt_generation.py # Prompt for product-specific llms.txt
├── index_js_homepage_generation.py # Prompt for Call 2
└── data_json_generation.py # Prompt for Call 3
Database Updates
Updates the ai_sites table:
INSERT INTO ai_sites (
entity_id,
ai_site_url,
vercel_deployment_url,
page_hashes,
site_map,
deployed_at
) VALUES (...)
Error Handling
{
"status": "error",
"error": "Vercel deployment failed: rate limit exceeded"
}
If deployment fails, the error is logged but onboarding continues. The site can be regenerated later via the manual trigger endpoint.