Skip to content

GEO Agent

The GEO (Generative Engine Optimization) Agent transforms real visitor questions from the Website Agent into published FAQ pages that get cited by AI search engines (ChatGPT, Perplexity, Google AI Overviews) and improve the Website Agent's own answers.

Pipeline

  1. Collect visitor questions from Supabase messages × conversations (last 30 days, production only).
  2. Filter noise (support requests, navigation queries, bots, spam).
  3. Group by page_url and pick top-N pages by question volume.
  4. Cluster questions per page (DBSCAN, cosine distance) + a site-wide global pool.
  5. Generate FAQ pairs grounded in the client's knowledge base (LightRAG hybrid retrieval).
  6. Store in geo.faqs (separate from config.knowledge_faqs).
  7. Render static HTML + JSON-LD (FAQPage + Article + BreadcrumbList).
  8. Publish HTML + sitemap + metadata to Cloudflare R2.
  9. Notify search engines of the change.

See docs/plans/2026-04-10-ix-2453-geo-content-agent-design.md for the full architecture.

Publishing & Delivery

Rendered FAQ HTML lives in Cloudflare R2 (rose-geo-pages bucket) and is served by the cloudflare-geo-pages Worker at faq.userose.ai. Clients can optionally point a CNAME faq.client.com → faq.userose.ai; the Worker resolves the host to a tenant via per-tenant meta.json stored alongside the HTML.

Canonical URL strategy

Each tenant has exactly one canonical URL to avoid SEO duplicate-content penalties. The canonical is baked into the static HTML at render time:

Tenant config Canonical URL
No custom domain https://faq.userose.ai/{site_domain}/frequently-asked-questions
Custom CNAME configured (publishing.public_host) https://{public_host}/frequently-asked-questions

The Worker enforces canonicalization: any request whose Host does not match the tenant's public_host is 301-redirected to the canonical URL. The HTML body itself is identical at both hosts (same R2 object), but only one URL is indexable.

Search Engine Discoverability

After each pipeline run the Worker (and the pipeline's IndexNow ping) make the new content discoverable across the major engines used by AI search.

Engine Mechanism Indexing latency Why it matters
Google sitemap.xml with <lastmod> + JSON-LD dateModified Hours → days Largest crawl footprint. Powers Google AI Overviews.
Bing IndexNow protocol (real-time push) Minutes Powers ChatGPT search — direct path into AI citations.
Yandex IndexNow protocol Minutes Same protocol as Bing.
Perplexity Inherits Bing + Google indexes Inherits both No dedicated submission API.
All bots robots.txt + per-tenant sitemap.xml Continuous Allows discovery without prior knowledge of the URL.

Sitemap

Each tenant gets a single-URL sitemap.xml containing the canonical FAQ URL and <lastmod> = pipeline run timestamp. The Worker serves it at:

  • https://faq.userose.ai/{site_domain}/sitemap.xml
  • https://faq.client.com/sitemap.xml (custom domains)

The <lastmod> value is the only freshness signal Google reliably honors after they removed the ping endpoint in January 2024.

IndexNow (Bing / Yandex / ChatGPT)

IndexNow is a push protocol supported by Bing, Yandex, Naver, Seznam, and Yep. A single POST to https://api.indexnow.org/indexnow notifies all participating engines that a URL changed; they recrawl within minutes instead of days.

Why it's worth wiring up: Bing's index is the primary source for ChatGPT search results. Faster Bing indexing → faster AI citations for newly generated FAQs.

Domain ownership verification: IndexNow requires a static key file (e.g. /{indexnow_key}.txt) served at the same domain as the submitted URL, containing the key string as its body. The Worker serves this file from configuration.

Failure mode: The pipeline's IndexNow ping is best-effort — failures are logged but do not fail the pipeline. The sitemap remains the canonical fallback.

Google: no equivalent push API

Google deprecated google.com/ping?sitemap=... in January 2024 and offers no general-purpose real-time indexing API:

  • Indexing API: documented as JobPosting and BroadcastEvent only. Using it for other content is undocumented and may be ignored or penalized — not used here.
  • Search Console URL Inspection API: ~2000 requests/day per property, too restrictive for automated use.

The supported path for Google is sitemap + accurate <lastmod>, plus a one-time Search Console verification per domain after Worker deployment.

Response Headers

The Worker sets the following headers on FAQ HTML responses:

Content-Type: text/html; charset=utf-8
Cache-Control: public, max-age=86400
X-Robots-Tag: index, follow

Edge KV caches the R2 object for 60s so a fresh pipeline run propagates quickly; the downstream Cache-Control is the longer 24h browser/CDN TTL.

Schema Stacking

Every rendered page contains three JSON-LD blocks for maximum AI citation coverage:

  • FAQPage — primary structured-data schema for Q&A content.
  • Article — wraps the page as a publishable article with datePublished and dateModified set to the pipeline run timestamp.
  • BreadcrumbList — navigation context.

Research consistently shows that stacked schemas roughly double AI citation rates versus a single FAQPage block. dateModified is set at pipeline render time (not at every HTML construction) so cached HTML stays consistent with what search engines see.

Configuration

The publishing block of the geo config slug controls page styling and delivery:

Field Type Purpose
public_host str \| None Custom delivery host. When set, canonical URL is https://{public_host}/frequently-asked-questions. When unset, defaults to https://faq.userose.ai/{site_domain}/frequently-asked-questions.
title, subtitle, logo_url, custom_css, theme_preset, color fields, font fields various Visual styling — see GeoPublishing model.