Skip to main content
Extract products from page content and save to database. This is the first step in the product pipeline:
  1. discover-products (this endpoint) - Extract products from content
  2. generate-product-prompts - Generate visibility prompts for products
  3. generate-product-llms-txt - Generate llms files for products

How It Works

Architecture

For large websites, content is processed efficiently:
  1. Chunking - Content split into 15K char chunks
  2. Parallel extraction - All chunks processed simultaneously
  3. Simple prompts - Gemini only extracts name + description (fast)
  4. Deduplication - Final Gemini pass merges similar products
  5. URL mapping - Python string matching finds which pages mention each product
This architecture avoids timeout issues by keeping each Gemini call focused and fast.

Request Body

FieldTypeRequiredDescription
business_idstringYesClerk org ID
pagesarrayYesPages with {url, markdown}
business_infoobjectYesBusiness entity info

business_info Object

FieldTypeDescription
entity_idstringBusiness entity UUID
urlstringBusiness website URL
namestringBusiness name
clerk_org_idstringClerk org ID

Response

{
  "status": "success",
  "existing_products": 3,
  "new_products": [
    {
      "name": "Product A",
      "entity_id": "uuid-123",
      "source_urls": ["https://example.com/products/a"],
      "url": "https://example.com/products/a"
    }
  ]
}

What This Endpoint Does NOT Do

  • ❌ Generate prompts (use generate-product-prompts)
  • ❌ Generate llms files (use generate-product-llms-txt)
  • ❌ Deploy to Vercel
This separation allows for:
  1. Better debugging - Run each step independently
  2. Easier retries - If prompts fail, don’t re-discover products
  3. Parallel execution - Run prompts and llms generation in parallel

Database Updates

TableAction
entitiesInsert new products with type: "product" and product_source_urls

Code Location

src/app/apis/cron/discover_products/routes.py
src/app/shared/products/discover.py  # Core extraction logic