Skip to main content

Purpose

Extracts products from page content using Gemini and saves them to the database. This is the first step in the product pipeline.

Architecture

Why Chunking?

Large websites can have 50K+ chars of content. Asking Gemini to process all of it in one call causes timeouts. By splitting into 15K char chunks:
  1. Faster responses - Each chunk completes in ~10-30s
  2. Parallel processing - All chunks run simultaneously
  3. Better reliability - If one chunk fails, others still succeed

Simplified Extraction

Each Gemini call only extracts name + description:
BUSINESS: {business_name}
EXISTING PRODUCTS (DO NOT INCLUDE): [list]
CONTENT: [chunk content]

Return JSON:
{"products": [{"name": "...", "description": "..."}]}
Source URLs are mapped after extraction using simple Python string matching - much faster than asking Gemini to track URLs.

Source URL Mapping

After extraction, we find which pages mention each product:
for product in products:
    for page in pages:
        if product["name"].lower() in page["markdown"].lower():
            product["source_urls"].append(page["url"])
This is instant compared to asking Gemini to do it.

Pipeline Flow

1. discover-products (this endpoint)
   └── Split content into chunks
   └── Extract names + descriptions in parallel
   └── Deduplicate similar products
   └── Map source URLs
   └── Save to entities table

2. generate-product-prompts
   └── Takes entity_ids from step 1
   └── Generates 10 prompts per product

3. generate-product-llms-txt
   └── Takes entity_ids from step 1
   └── Generates /llms/{slug}.txt files

Code Location

src/app/apis/cron/discover_products/routes.py
src/app/shared/products/discover.py  # Core extraction logic