Skip to main content

Purpose

Detects content changes on a website using an efficient 2-stage approach that minimizes Firecrawl costs:
  1. Hashing Service - Fetch raw HTML directly (free) and compare hashes
  2. Batch Scrape - Only scrape pages that actually changed (paid)

Architecture

Detection Logic

  1. Get stored data - Retrieve site_map (URL list) and page_hashes from ai_sites
  2. Map current URLs - Use custom mapper to get current URL list [FREE]
  3. Compare URL lists - Find new URLs, removed URLs, existing URLs
  4. Fetch + Hash existing pages - Use hashing service to get raw HTML and hash [FREE]
  5. Compare hashes - Find which existing pages changed
  6. Batch scrape - Only scrape NEW + CHANGED pages with Firecrawl [PAID]

Cost Optimization

StageServiceCostWhat it does
1Custom MapperFreeSitemap + robots.txt + HTML links
2Hashing ServiceFreeRaw HTTP GET + SHA-256 hash
3Firecrawl Batch~$0.01/pageOnly for new + changed pages
Example: Site with 100 pages, 3 changed
  • Old approach: 100 × 0.01=0.01 = 1.00
  • New approach: 3 × 0.01=0.01 = 0.03

Response Format

{
  "status": "success",
  "new_pages": [...],        // Have markdown (batch scraped)
  "changed_pages": [...],    // Have markdown (batch scraped)
  "removed_urls": [...],
  "unchanged_pages": [...],  // NO markdown (not scraped)
  "updated_site_map": [...],
  "updated_hashes": {...},
  "business_info": {...}
}

Change Categories

CategoryHas Markdown?Description
new_pagesYesURLs that didn’t exist before
changed_pagesYesURLs with different content hash
unchanged_pagesNoURLs with same hash (not scraped)
removed_urlsN/AURLs that no longer exist

Code Location

src/app/apis/cron/detect_changes/
├── routes.py           # HTTP endpoint
└── mini_orchestrator.py # Detection logic

src/app/shared/hashing/
├── __init__.py         # Exports
└── service.py          # fetch_and_hash, fetch_and_hash_batch

src/app/shared/mapping/
├── __init__.py         # Exports
└── mapper.py           # map_website

Used By

  • Daily cron job for site freshness
  • Triggers update-ai-site when changes found
  • Triggers discover-products-from-changes for new pages