Skip to main content

Purpose

Discovers NEW products from a Shopify store using hash-based change detection. This endpoint handles the complete product discovery pipeline internally - fetching, saving, prompt generation, and LLMs file generation.
Batch 1b: This endpoint runs in parallel with Batch 1a (update-ai-site). It’s completely decoupled - no scraped content needed.

Architecture

Why Shopify products.json?

Shopify stores expose a public products.json API endpoint that returns all product data:
  1. Faster - Direct API call vs scraping/AI extraction
  2. More reliable - Structured JSON data, no parsing errors
  3. Complete data - Title, description, variants, images, pricing
  4. No AI costs - No Gemini/OpenAI API calls needed for product discovery

Hash-Based Change Detection

Instead of re-processing all products every time, we use efficient change detection:
ColumnTypePurpose
products_hashTEXTMD5 hash of sorted product handles for quick comparison
products_snapshotJSONBFull product list from last sync
Hash comparison:
- If hash unchanged β†’ skip entirely (no work needed)
- If hash changed β†’ compare snapshots to find NEW products only

Product Data Extracted

Each product from Shopify includes:
{
  "name": "Product Title",
  "description": "Product description (cleaned from HTML)",
  "url": "https://store.com/products/handle",
  "handle": "product-handle",
  "product_type": "Category",
  "vendor": "Brand Name",
  "price": "29.99",
  "image_url": "https://cdn.shopify.com/...",
  "meta_title": "Product Title | Store Name",
  "meta_description": "SEO description for this product",
  "variants": [
    {"id": 123, "title": "Default", "price": "29.99", "sku": "ABC123"}
  ],
  "images": ["https://cdn.shopify.com/..."],
  "tags": "tag1, tag2",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-15T00:00:00Z"
}
Meta title and description are always fetched from each product page (with rate limiting to be respectful).

What This Endpoint Does (All-in-One)

This endpoint handles the complete product pipeline internally:
discover-products (this endpoint)
β”œβ”€β”€ Fetch /products.json from Shopify
β”œβ”€β”€ Compute hash & compare with stored products_hash
β”œβ”€β”€ If changed:
β”‚   β”œβ”€β”€ Compare snapshots to find NEW products
β”‚   β”œβ”€β”€ Save new products to entities table
β”‚   β”œβ”€β”€ Generate 10 prompts per new product (internal call)
β”‚   β”œβ”€β”€ Generate /llms/{slug}.txt for new products (internal call)
β”‚   β”œβ”€β”€ Return files for combined deploy (if skip_deploy=True)
β”‚   └── Update products_hash & products_snapshot
└── If unchanged:
    └── Return "unchanged" status (skip)
Standalone endpoints exist for manual use: The separate endpoints (generate-product-prompts, generate-product-llms-txt) exist for manual triggering and debugging, but are NOT called by the cron job.

Code Location

src/app/apis/cron/discover_products/routes.py
src/app/shared/products/discover.py  # Core Shopify product fetching logic
src/app/shared/products/generate_llms_txt.py  # LLMs generation