GEO Agent¶

The GEO (Generative Engine Optimization) Agent transforms real visitor questions from the Website Agent into published FAQ pages that get cited by AI search engines (ChatGPT, Perplexity, Google AI Overviews) and improve the Website Agent's own answers.

Pipeline¶

Collect visitor questions from Supabase messages × conversations (last 30 days, production only).
Filter noise (support requests, navigation queries, bots, spam).
Group by page_url and pick top-N pages by question volume.
Cluster questions per page (DBSCAN, cosine distance) + a site-wide global pool.
Generate FAQ pairs grounded in the client's knowledge base (LightRAG hybrid retrieval).
Store in geo.faqs (separate from config.knowledge_faqs).
Render static HTML + JSON-LD (FAQPage + Article + BreadcrumbList).
Publish HTML + sitemap + metadata to Cloudflare R2.
Notify search engines of the change.

See docs/plans/2026-04-10-ix-2453-geo-content-agent-design.md for the full architecture.

Publishing & Delivery¶

Rendered FAQ HTML lives in Cloudflare R2 (rose-geo-pages bucket) and is served by the cloudflare-geo-pages Worker at faq.userose.ai. Clients can optionally point a CNAME faq.client.com → faq.userose.ai; the Worker resolves the host to a tenant via per-tenant meta.json stored alongside the HTML.

Canonical URL strategy¶

Each tenant has exactly one canonical URL to avoid SEO duplicate-content penalties. The canonical is baked into the static HTML at render time:

Tenant config	Canonical URL
No custom domain	`https://faq.userose.ai/{site_domain}/frequently-asked-questions`
Custom CNAME configured (`publishing.public_host`)	`https://{public_host}/frequently-asked-questions`

The Worker enforces canonicalization: any request whose Host does not match the tenant's public_host is 301-redirected to the canonical URL. The HTML body itself is identical at both hosts (same R2 object), but only one URL is indexable.

Search Engine Discoverability¶

After each pipeline run the Worker (and the pipeline's IndexNow ping) make the new content discoverable across the major engines used by AI search.

Engine	Mechanism	Indexing latency	Why it matters
Google	`sitemap.xml` with `<lastmod>` + JSON-LD `dateModified`	Hours → days	Largest crawl footprint. Powers Google AI Overviews.
Bing	IndexNow protocol (real-time push)	Minutes	Powers ChatGPT search — direct path into AI citations.
Yandex	IndexNow protocol	Minutes	Same protocol as Bing.
Perplexity	Inherits Bing + Google indexes	Inherits both	No dedicated submission API.
All bots	`robots.txt` + per-tenant `sitemap.xml`	Continuous	Allows discovery without prior knowledge of the URL.

Sitemap¶

Each tenant gets a single-URL sitemap.xml containing the canonical FAQ URL and <lastmod> = pipeline run timestamp. The Worker serves it at:

https://faq.userose.ai/{site_domain}/sitemap.xml
https://faq.client.com/sitemap.xml (custom domains)

The <lastmod> value is the only freshness signal Google reliably honors after they removed the ping endpoint in January 2024.

IndexNow (Bing / Yandex / ChatGPT)¶

IndexNow is a push protocol supported by Bing, Yandex, Naver, Seznam, and Yep. A single POST to https://api.indexnow.org/indexnow notifies all participating engines that a URL changed; they recrawl within minutes instead of days.

Why it's worth wiring up: Bing's index is the primary source for ChatGPT search results. Faster Bing indexing → faster AI citations for newly generated FAQs.

Domain ownership verification: IndexNow requires a static key file (e.g. /{indexnow_key}.txt) served at the same domain as the submitted URL, containing the key string as its body. The Worker serves this file from configuration.

Failure mode: The pipeline's IndexNow ping is best-effort — failures are logged but do not fail the pipeline. The sitemap remains the canonical fallback.

Google: no equivalent push API¶

Google deprecated google.com/ping?sitemap=... in January 2024 and offers no general-purpose real-time indexing API:

Indexing API: documented as JobPosting and BroadcastEvent only. Using it for other content is undocumented and may be ignored or penalized — not used here.
Search Console URL Inspection API: ~2000 requests/day per property, too restrictive for automated use.

The supported path for Google is sitemap + accurate <lastmod>, plus a one-time Search Console verification per domain after Worker deployment.

Response Headers¶

The Worker sets the following headers on FAQ HTML responses:

Content-Type: text/html; charset=utf-8
Cache-Control: public, max-age=86400
X-Robots-Tag: index, follow

Edge KV caches the R2 object for 60s so a fresh pipeline run propagates quickly; the downstream Cache-Control is the longer 24h browser/CDN TTL.

Schema Stacking¶

Every rendered page contains three JSON-LD blocks for maximum AI citation coverage:

FAQPage — primary structured-data schema for Q&A content.
Article — wraps the page as a publishable article with datePublished and dateModified set to the pipeline run timestamp.
BreadcrumbList — navigation context.

Research consistently shows that stacked schemas roughly double AI citation rates versus a single FAQPage block. dateModified is set at pipeline render time (not at every HTML construction) so cached HTML stays consistent with what search engines see.

Configuration¶

The publishing block of the geo config slug controls page styling and delivery:

Field	Type	Purpose
`public_host`	`str \\| None`	Custom delivery host. When set, canonical URL is `https://{public_host}/frequently-asked-questions`. When unset, defaults to `https://faq.userose.ai/{site_domain}/frequently-asked-questions`.
`title`, `subtitle`, `logo_url`, `custom_css`, `theme_preset`, color fields, font fields	various	Visual styling — see `GeoPublishing` model.