GEO Agent¶
The GEO (Generative Engine Optimization) Agent transforms real visitor questions from the Website Agent into published FAQ pages that get cited by AI search engines (ChatGPT, Perplexity, Google AI Overviews) and improve the Website Agent's own answers.
Pipeline¶
- Collect visitor questions from Supabase
messages×conversations(last 30 days, production only). - Filter noise (support requests, navigation queries, bots, spam).
- Group by
page_urland pick top-N pages by question volume. - Cluster questions per page (DBSCAN, cosine distance) + a site-wide global pool.
- Generate FAQ pairs grounded in the client's knowledge base (LightRAG hybrid retrieval).
- Store in
geo.faqs(separate fromconfig.knowledge_faqs). - Render static HTML + JSON-LD (FAQPage + Article + BreadcrumbList).
- Publish HTML + sitemap + metadata to Cloudflare R2.
- Notify search engines of the change.
See docs/plans/2026-04-10-ix-2453-geo-content-agent-design.md for the full architecture.
Publishing & Delivery¶
Rendered FAQ HTML lives in Cloudflare R2 (rose-geo-pages bucket) and is served by the cloudflare-geo-pages Worker at faq.userose.ai. Clients can optionally point a CNAME faq.client.com → faq.userose.ai; the Worker resolves the host to a tenant via per-tenant meta.json stored alongside the HTML.
Canonical URL strategy¶
Each tenant has exactly one canonical URL to avoid SEO duplicate-content penalties. The canonical is baked into the static HTML at render time:
| Tenant config | Canonical URL |
|---|---|
| No custom domain | https://faq.userose.ai/{site_domain}/frequently-asked-questions |
Custom CNAME configured (publishing.public_host) |
https://{public_host}/frequently-asked-questions |
The Worker enforces canonicalization: any request whose Host does not match the tenant's public_host is 301-redirected to the canonical URL. The HTML body itself is identical at both hosts (same R2 object), but only one URL is indexable.
Search Engine Discoverability¶
After each pipeline run the Worker (and the pipeline's IndexNow ping) make the new content discoverable across the major engines used by AI search.
| Engine | Mechanism | Indexing latency | Why it matters |
|---|---|---|---|
sitemap.xml with <lastmod> + JSON-LD dateModified |
Hours → days | Largest crawl footprint. Powers Google AI Overviews. | |
| Bing | IndexNow protocol (real-time push) | Minutes | Powers ChatGPT search — direct path into AI citations. |
| Yandex | IndexNow protocol | Minutes | Same protocol as Bing. |
| Perplexity | Inherits Bing + Google indexes | Inherits both | No dedicated submission API. |
| All bots | robots.txt + per-tenant sitemap.xml |
Continuous | Allows discovery without prior knowledge of the URL. |
Sitemap¶
Each tenant gets a single-URL sitemap.xml containing the canonical FAQ URL and <lastmod> = pipeline run timestamp. The Worker serves it at:
https://faq.userose.ai/{site_domain}/sitemap.xmlhttps://faq.client.com/sitemap.xml(custom domains)
The <lastmod> value is the only freshness signal Google reliably honors after they removed the ping endpoint in January 2024.
IndexNow (Bing / Yandex / ChatGPT)¶
IndexNow is a push protocol supported by Bing, Yandex, Naver, Seznam, and Yep. A single POST to https://api.indexnow.org/indexnow notifies all participating engines that a URL changed; they recrawl within minutes instead of days.
Why it's worth wiring up: Bing's index is the primary source for ChatGPT search results. Faster Bing indexing → faster AI citations for newly generated FAQs.
Domain ownership verification: IndexNow requires a static key file (e.g. /{indexnow_key}.txt) served at the same domain as the submitted URL, containing the key string as its body. The Worker serves this file from configuration.
Failure mode: The pipeline's IndexNow ping is best-effort — failures are logged but do not fail the pipeline. The sitemap remains the canonical fallback.
Google: no equivalent push API¶
Google deprecated google.com/ping?sitemap=... in January 2024 and offers no general-purpose real-time indexing API:
- Indexing API: documented as
JobPostingandBroadcastEventonly. Using it for other content is undocumented and may be ignored or penalized — not used here. - Search Console URL Inspection API: ~2000 requests/day per property, too restrictive for automated use.
The supported path for Google is sitemap + accurate <lastmod>, plus a one-time Search Console verification per domain after Worker deployment.
Response Headers¶
The Worker sets the following headers on FAQ HTML responses:
Content-Type: text/html; charset=utf-8
Cache-Control: public, max-age=86400
X-Robots-Tag: index, follow
Edge KV caches the R2 object for 60s so a fresh pipeline run propagates quickly; the downstream Cache-Control is the longer 24h browser/CDN TTL.
Schema Stacking¶
Every rendered page contains three JSON-LD blocks for maximum AI citation coverage:
FAQPage— primary structured-data schema for Q&A content.Article— wraps the page as a publishable article withdatePublishedanddateModifiedset to the pipeline run timestamp.BreadcrumbList— navigation context.
Research consistently shows that stacked schemas roughly double AI citation rates versus a single FAQPage block. dateModified is set at pipeline render time (not at every HTML construction) so cached HTML stays consistent with what search engines see.
Configuration¶
The publishing block of the geo config slug controls page styling and delivery:
| Field | Type | Purpose |
|---|---|---|
public_host |
str \| None |
Custom delivery host. When set, canonical URL is https://{public_host}/frequently-asked-questions. When unset, defaults to https://faq.userose.ai/{site_domain}/frequently-asked-questions. |
title, subtitle, logo_url, custom_css, theme_preset, color fields, font fields |
various | Visual styling — see GeoPublishing model. |