Skip to main content
Detect content changes on a website using an efficient 2-stage approach:
  1. Hashing Service (free) - Fetch raw HTML and hash to find changes
  2. Firecrawl Batch Scrape (paid) - Only scrape pages that actually changed
This is the first step in the 3-API update flow:
  1. detect-changes (this endpoint) - Find what changed
  2. update-ai-site - Update the AI website
  3. discover-products-from-changes - Find new products

How It Works

1. Custom Mapper β†’ Get current URL list (sitemap + robots.txt + HTML links) [FREE]
2. Compare URLs vs stored site_map
   - New URLs = new pages
   - Missing URLs = removed pages
3. Hashing Service β†’ Fetch raw HTML + hash for existing pages [FREE]
4. Compare hashes vs stored hashes
   - Hash mismatch = content changed
5. Batch Scrape ONLY new + changed URLs β†’ Get markdown [PAID - only what's needed]
6. Return all data for downstream APIs

Cost Efficiency

StepCostDescription
Custom MapperFreeOur internal service
Hashing ServiceFreeRaw HTTP GET + SHA-256
Batch Scrape~$0.01/pageOnly for new + changed pages
Before optimization: Batch scrape ALL pages every time After optimization: Batch scrape only pages that actually changed

Request Body

FieldTypeRequiredDescription
business_idstringYesClerk org ID

Response

{
  "status": "success",  // or "unchanged" or "error"
  "new_pages": [
    {"url": "https://example.com/new-page", "markdown": "...", "hash": "abc123"}
  ],
  "changed_pages": [
    {"url": "https://example.com/about", "markdown": "...", "old_hash": "def456", "new_hash": "ghi789"}
  ],
  "removed_urls": ["https://example.com/old-page"],
  "unchanged_pages": [
    {"url": "https://example.com/", "hash": "..."}
  ],
  "updated_site_map": ["https://example.com/", "https://example.com/about", ...],
  "updated_hashes": {"https://example.com/": "abc123", ...},
  "business_info": {
    "entity_id": "uuid",
    "url": "https://example.com",
    "name": "Example Company",
    "clerk_org_id": "org_xxx",
    "ai_site_id": "uuid",
    "deployment_url": "https://example.searchcompany.dev",
    "project_name": "example-searchcompany-dev"
  }
}
Key difference: unchanged_pages does NOT have markdown - they weren’t batch scraped. Only new_pages and changed_pages have markdown content.

Status Values

StatusMeaning
successChanges detected, proceed with update
unchangedNo changes detected, skip update
errorSomething went wrong

Usage in Cron

# Step 1: Detect changes
changes = await detect_changes(business_id)

if changes["status"] == "unchanged":
    return  # Nothing to do

# Step 2 & 3: Run in parallel
await asyncio.gather(
    update_ai_site(business_id, changes),
    discover_products_from_changes(business_id, changes)
)

Database Reads

  • entities table - Get business entity info
  • ai_sites table - Get site_map and page_hashes

External API Calls

  • Custom Website Mapper (src/app/shared/mapping) - Get URL list [FREE]
  • Hashing Service (src/app/shared/hashing) - Fetch raw HTML + hash [FREE]
  • Firecrawl Batch Scrape API (/v2/batch/scrape) - Get markdown [PAID, only for changed pages]