ADR: Backoffice Analytics — Layered Architecture in Postgres¶
Status¶
Draft
Date¶
2026-05-08
Context¶
The backoffice consumes analytics from Supabase via three parallel paths today:
- Materialized views + RPCs —
mv_widget_visitors,mv_conversation_visitors,mv_client_stats_30d, plus thevisitor_session_metricsview, exposed through ~12get_*RPC functions. - Direct table queries from TypeScript —
frontend/client-backoffice/src/services/conversations.tsandaccounts.tsre-implement aggregations in JavaScript by issuingcount: 'exact'queries and pulling rows to aggregate client-side. pg_cronschedules owned independently per MV.
This produces three concrete pains:
1. Logic lives in many places¶
The filter WHERE environment = 'production' AND person_id IS NOT NULL AND deployment_target = 'widget' and the helper extract_root_domain(lower(site_domain)) appear in:
mv_widget_visitors,mv_conversation_visitors,mv_client_stats_30dget_visitor_stats,get_visitor_stats_with_intent,get_widget_display_stats,get_engagement_rate,get_conversation_funnel,get_conversations_over_time,get_account_stats_with_intent- TypeScript:
fetchConversationStats,fetchCTAClickedStats,fetchDemoBookedStats,fetchEmailCapturedStats,fetchTopicStats,fetchAccountStats
When the definition of "production conversation" or "active client" changes, ~19 places need to be edited. This caused several recent bugs (fix_client_stats_30d_per_person, fix_page_stats_ambiguous_page_url, safe_cast_stats_valid_from_in_mv_client_stats_30d).
2. Materialized view churn¶
mv_client_stats_30d has been re-created (full DROP + CREATE) seven times between 2026-04-01 and 2026-05-08 because Postgres has no CREATE OR REPLACE MATERIALIZED VIEW. Each fix is a ~100-line migration body re-pasted with a small change. Migration history is dominated by analytics churn (~55 of 236 migrations to date are analytics-related).
3. TypeScript leak¶
Four backoffice service functions issue analytics aggregations directly against public.conversations instead of going through an RPC:
| File | Function | Behavior |
|---|---|---|
services/conversations.ts:501 |
fetchConversationStats |
count(*) two periods |
services/conversations.ts:972 |
fetchCTAClickedStats |
same + cta_clicked |
services/conversations.ts:1019 |
fetchDemoBookedStats |
same + demo_booked |
services/conversations.ts:1066 |
fetchEmailCapturedStats |
same + email_captured |
services/conversations.ts:1188 |
fetchTopicStats |
pulls full topic arrays, aggregates in JS |
services/conversations.ts:1370 |
fetchDistinctTopics |
pulls full topic arrays, dedups in JS |
services/accounts.ts:363 |
fetchAccountStats |
paginated visitor → account dedup in JS |
These four service files contain ~12 round-trips per dashboard load that should be one RPC call.
4. Refresh ordering is accidental¶
mv_widget_visitors and mv_conversation_visitors refresh on '0 */4 * * *'. mv_client_stats_30d reads from both and refreshes on '0 2 * * *'. For most of the day, the rollup is built on stale parents.
5. Non-incremental rebuilds and per-request recomputation¶
Three compounding patterns make analytics cost grow with retained event history rather than with the time window users actually look at.
Per-request recomputation on arbitrary date ranges. The per-client AnalyticsPage offers 7/14/30/90 days plus custom ranges. None of the served RPCs read from a mart — they raw-scan public.conversations and public.session_events on every range change:
get_conversations_over_time(supabase/migrations/20260225201610_add_analytics_rpc_functions.sql:9) — scan ofconversationsfiltered bystarted_at,GROUP BY date_trunc('day'|'week', started_at). Runs on every range change.get_conversation_funnel(same file) — fourCOUNT(*) FILTERpasses over the same scan: widget displayed, conversation started, email captured, demo booked.get_engagement_rate(20260226170759_…sql) —COUNT(DISTINCT person_id)four times (current + previous period × widget + conversation) over rawsession_events.
mv_client_stats_30d is consumed only by the all-clients listing page; the per-client dashboard never reads it.
Visitor timeline rebuilt in JavaScript per open. frontend/client-backoffice/src/hooks/useTimelineData.ts:51-313 issues three raw scans (visitor_page_views, session_events with an event-name allowlist, messages), then merges, sorts chronologically, and dedupes-by-minute client-side (lines 233–310). Every panel open repeats the full pipeline.
Marts themselves are full-rebuild. Per Decision §3, every mart refresh re-aggregates the full source history. mv_client_stats_30d re-scans 30 days of events nightly; mv_widget_visitors and mv_conversation_visitors re-aggregate the entire session_events table every 4 hours. pg_stat_user_tables shows mv_widget_visitors at 1.68M live rows against 37.9M cumulative inserts — ≈22 full rebuilds.
At today's volumes a nightly full rebuild completes in seconds, which justifies the original "Rebuild, not incremental" choice. But the per-request RPCs above scale linearly with retained event history forever — a user asking for "last 7 days" still triggers a scan of all conversation history. The 2026-05-12 velocity audit recorded session_events ingesting ~204k rows/day, ≈3× its lifetime average, with the table at 8.6 GB (2 GB heap + 6.6 GB of indexes whose primary purpose is making these scans tolerable). The 60-second rebuild threshold mentioned in §3 is months away, not years.
Constraints¶
- One Supabase database, shared across
test/staging/production. Per-environment isolation exists for Neo4j/Mongo/LightRAG only. Any analytics infrastructure runs against production data. - The backoffice consumes via PostgREST
.rpc(), gated bySECURITY DEFINERfunctions checkingis_backoffice_admin(). - The team is two backend engineers. We cannot afford a separate analytics warehouse, dbt Cloud, or a third orchestrator to maintain.
Decision¶
Adopt a layered analytics architecture inside the existing Supabase Postgres, with a hard rule that all backoffice metrics flow through SQL — no aggregation in TypeScript.
1. Three new schemas¶
| Schema | Purpose | Mutability |
|---|---|---|
public |
OLTP source tables (unchanged) | Owned by app |
analytics_helpers |
Pure functions (extract_root_domain, percentiles, etc.) |
Stable, rarely changes |
analytics_stg |
Staging views — filter + canonicalize one source table at a time | Recomputed on every read |
analytics |
Marts — plain tables maintained by one of three patterns: (a) full rebuild via refresh_*() function + atomic swap, (b) incremental daily-grain upsert via refresh_*(p_date) function, (c) trigger-maintained insert-only stream from public.*. Plus thin views on top of marts (e.g. client_stats_30d). |
Patterns (a) and (b) replaced by scheduled jobs; (c) by triggers |
Dependencies flow strictly downward: public → analytics_stg → analytics → public.get_* RPCs → backoffice TS.
The one deliberate exception is the trigger-maintained mart analytics.visitor_timeline_events (§9b): AFTER INSERT triggers on public.* write directly into analytics.* and apply the equivalent of the staging-view filter inline (event-name allowlist, person_id non-null, etc.) because triggers can't fire through a view. Read consumers still go through the standard analytics → RPC → TS path; only the write side bypasses analytics_stg.
2. Staging views deduplicate filter logic¶
Each analytics_stg.stg_* view is a thin CREATE VIEW over one source table that:
- Filters out rows analytics never wants (test environments, null
person_id, non-widget deployments). - Adds canonical columns (
root_domain, normalized timestamps). - Performs no aggregation, no joins to other source tables, no business logic.
Initial set:
- analytics_stg.stg_conversations (production filter, root_domain, person_id non-null)
- analytics_stg.stg_widget_impressions (from session_events where event_name = 'widget_impression')
- analytics_stg.stg_visitor_sessions
- analytics_stg.stg_visitors
The contract: "any code that wants to count conversations should FROM analytics_stg.stg_conversations, never FROM public.conversations."
3. Marts are plain tables rebuilt by functions¶
Replace materialized views with plain tables built by CREATE OR REPLACE FUNCTION bodies. Two refresh shapes are used, chosen per mart by whether the mart decomposes into daily slices:
- Full-rebuild marts — snapshot or row-per-entity semantics that don't sum across days. Each refresh writes a
__newtable and atomically swaps it in. - Incremental daily-grain marts — fact tables keyed by
(date, ...)where yesterday's row never changes again. Each refresh upserts one date's rows. See §9.
Full-rebuild pattern¶
CREATE OR REPLACE FUNCTION analytics.refresh_visitor_session_metrics()
RETURNS void LANGUAGE plpgsql AS $$
BEGIN
CREATE TABLE analytics.visitor_session_metrics__new AS
WITH ... -- references analytics_stg.*
SELECT ...;
CREATE UNIQUE INDEX ON analytics.visitor_session_metrics__new (session_id);
-- atomic swap (millisecond ACCESS EXCLUSIVE)
ALTER TABLE IF EXISTS analytics.visitor_session_metrics RENAME TO visitor_session_metrics__old;
ALTER TABLE analytics.visitor_session_metrics__new RENAME TO visitor_session_metrics;
DROP TABLE IF EXISTS analytics.visitor_session_metrics__old;
END $$;
Initial marts:
| Mart | Pattern | Replaces | Notes |
|---|---|---|---|
analytics.visitor_session_metrics |
Full rebuild | visitor_session_metrics view |
Materialized as a table for query performance. |
analytics.widget_visitors |
Full rebuild (row-per-person lookup) | mv_widget_visitors |
Kept while non-aggregate consumers exist; deprecation candidate once §9 covers all aggregate uses. |
analytics.conversation_visitors |
Full rebuild (row-per-person lookup) | mv_conversation_visitors |
Same disposition as above; may become a view if dependents are simple. |
analytics.daily_client_stats |
Incremental daily-grain (§9) | mv_client_stats_30d (via a thin client_stats_30d view on top) |
One row per (date, client_id). Yesterday immutable; today refreshed hourly. |
analytics.visitor_timeline_events |
Trigger-maintained insert-only (§9) | three-way scan + JS merge in useTimelineData.ts |
Not refreshed; populated by AFTER INSERT triggers on the source tables. |
client_stats_30d is a view, not a mart. Once daily_client_stats exists, the 30-day rollup is a plain view (SUM(...) WHERE date >= current_date - 30 GROUP BY client_id). No separate refresh, no drift possible by construction. The all-clients listing page repoints to this view.
Why two patterns. Full rebuild buys idempotency, self-healing on logic changes, and zero drift; it is fine for marts that complete in seconds at 2 AM (current data volumes: a few thousand active clients, a few million session_events). Incremental daily-grain buys O(time-window) refresh cost — required for marts where (a) the cost scales with retained event history rather than with the user-visible window, or (b) the same mart must be refreshed sub-hourly to keep "today" current. Range-driven dashboards (§9) and the visitor timeline (§9) are both in that category.
The trade-off of full-rebuild vs REFRESH MATERIALIZED VIEW CONCURRENTLY: the swap takes a millisecond-scale ACCESS EXCLUSIVE lock instead of running concurrently. Acceptable at 2 AM UTC; not acceptable for hot-path tables. Incremental upsert avoids the lock entirely.
4. Single orchestrator with explicit dependency order¶
Two crons — one nightly that closes yesterday and rebuilds non-incremental marts, one hourly that keeps today's row of every daily-grain mart fresh:
CREATE OR REPLACE FUNCTION analytics.refresh_all_nightly()
RETURNS void LANGUAGE plpgsql AS $$
BEGIN
-- Full-rebuild marts (§3)
PERFORM analytics.refresh_widget_visitors();
PERFORM analytics.refresh_conversation_visitors();
PERFORM analytics.refresh_visitor_session_metrics();
-- Incremental daily-grain marts (§9). Idempotent per date.
PERFORM analytics.refresh_daily_client_stats(current_date - 1);
-- analytics.visitor_timeline_events is trigger-maintained; nothing to refresh here.
PERFORM analytics.assert_invariants(); -- raises on bad data
END $$;
CREATE OR REPLACE FUNCTION analytics.refresh_all_hourly()
RETURNS void LANGUAGE plpgsql AS $$
BEGIN
-- Keep today's row of every daily-grain mart current.
PERFORM analytics.refresh_daily_client_stats(current_date);
END $$;
SELECT cron.schedule('analytics-refresh-all-nightly', '0 2 * * *',
$$SELECT analytics.refresh_all_nightly()$$);
SELECT cron.schedule('analytics-refresh-all-hourly', '7 * * * *',
$$SELECT analytics.refresh_all_hourly()$$);
Replaces the three independent pg_cron schedules from today's setup. Refresh order is correct by construction: full-rebuild marts feed nothing downstream (the 30d rollup is a plain view over daily_client_stats, not a mart), so they can run in any order before the daily upsert. If assert_invariants() raises, the nightly run fails atomically and Sentry catches the pg_cron error. The hourly does not run invariants — it touches only today's row and a partial-day assertion would be noisy.
5. In-database tests via assert_invariants()¶
A PL/pgSQL function called at the end of refresh_all_nightly() checks structural invariants and raises on violation:
Full-rebuild marts:
- engagement_rate and conversion_rate ∈ [0, 100] for every row in client_stats_30d (the view) and visitor_session_metrics
- total_conversations >= demo_booked_count + email_only_count
- every active client has a row in client_stats_30d
- no row in visitor_session_metrics older than 25 hours
Incremental daily-grain marts (§9):
- analytics.daily_client_stats has a row for every active client for current_date - 1 (yesterday must be closed)
- no daily_client_stats.date > current_date (no rows from the future)
- per-day non-negativity: every count column ≥ 0
- monotonicity: conversations_started >= conversation_visitors per row (you can't have more distinct visitors than conversation starts on the same day per client)
- backward-compat sanity is not asserted here — it lives in §10's comparison harness, which is the authoritative gate for cutover. assert_invariants() only checks structural properties that must hold regardless of legacy parity.
Each invariant is one block. The function grows when a fix migration adds a new invariant. This is the in-DB equivalent of dbt tests, sufficient at our scale.
6. RPCs become thin wrappers¶
After the marts exist, RPCs are rewritten to:
- Check
is_backoffice_admin(). SELECT FROM analytics.<mart>(or a view on top, e.g.analytics.client_stats_30d).- Apply presentation rounding.
No CTEs, no aggregation, no extract_root_domain calls in RPC bodies. RPC files shrink from ~200 lines to ~20.
The range-driven RPCs identified in Context §5 specifically map onto the daily-grain mart from §9:
| RPC | Reads from | Body shape |
|---|---|---|
get_conversations_over_time(domain, start, end, group_by) |
analytics.daily_client_stats |
SELECT date_trunc(group_by, date), SUM(...) WHERE root_domain = ? AND date BETWEEN ? AND ? GROUP BY 1 |
get_conversation_funnel(domain, start, end) |
analytics.daily_client_stats |
SELECT SUM(widget_visitors), SUM(conversation_visitors), SUM(email_only_persons), SUM(demo_persons) WHERE root_domain = ? AND date BETWEEN ? AND ? |
get_engagement_rate(domain, start, end, compare_start, compare_end) |
analytics.daily_client_stats (+ analytics.person_day if true-unique semantics are required) |
Two SUMs with FILTER (WHERE date BETWEEN ...) clauses, one query |
Visitor timeline (no RPC today — direct .from()) |
analytics.visitor_timeline_events |
SELECT ... WHERE visitor_id = ? ORDER BY occurred_at DESC LIMIT ? |
All four bodies become single-digit-line SELECTs with no CTEs.
7. Hard boundary against TypeScript aggregation¶
Add a project rule (in frontend/client-backoffice/CLAUDE.md and a pre-commit grep) prohibiting analytics aggregation in TS service files:
Backoffice services may call
supabase.rpc(...). They may not use.from(...).select('*', { count: 'exact' }),.from(...).select(<aggregatable columns>)followed by client-side reduce, or any pagination-then-deduplicate pattern. Direct.from(...)queries are reserved for fetching specific rows by id (single conversation detail, page views for one session, sessions for one visitor).
The leaked TS aggregations are replaced with new RPCs:
| New RPC | Replaces |
|---|---|
get_conversation_counts(domain, date_start, date_end, compare_start, compare_end, excluded_visitor_ids[]) returning (metric, current, previous) for total, cta_clicked, demo_booked, email_captured. Body reads from analytics.daily_client_stats (§9). |
fetchConversationStats, fetchCTAClickedStats, fetchDemoBookedStats, fetchEmailCapturedStats (4 round-trips → 1) |
get_topic_stats(...) |
fetchTopicStats |
get_distinct_topics(domain) |
fetchDistinctTopics |
Extended get_account_stats(..., excluded_visitor_ids[], excluded_account_ids[]) |
fetchAccountStats (eliminates JS pagination dance) |
get_visitor_timeline(visitor_id, limit, before) reading from analytics.visitor_timeline_events (§9). |
The three-way scan + JS merge / sort / dedupe-by-minute block in frontend/client-backoffice/src/hooks/useTimelineData.ts:51-313. The hook collapses to a single supabase.rpc('get_visitor_timeline', …) call with cursor pagination. |
8. Naming and convention¶
- Schema names encode the layer's contract —
_stg= staging, no aggregation;analytics= marts, do not re-aggregate. - Migration filenames for analytics use the pattern
..._analytics_<verb>_<object>.sqlso analytics-only changes are scannable in the migration list. - Marts are documented with a
COMMENT ON TABLElisting their source staging views. Lineage is greppable.
9. Daily-grain incremental marts for range-driven dashboards¶
The layered architecture in §1–§8 deduplicates filter logic and fixes refresh ordering, but every mart in §3 is still a full rebuild and every range-driven RPC in Context §5 still scans raw events per request. To make refresh cost O(time window) instead of O(history), add one additional mart shape — a daily-grain fact table — for the two read patterns that dominate analytics traffic.
A daily fact is closed at midnight: yesterday's row never changes again. Refresh becomes O(1 day of events) regardless of history depth. All time-windowed aggregations (7/14/30/90/custom) collapse to SUM(...) WHERE date BETWEEN ? over a tiny table. This is the standard OLAP daily-grain pattern; it slots into analytics like any other mart without altering the layers in §1.
9a. analytics.daily_client_stats¶
One row per (date, client_id). Size budget: ~1,000 clients × 365 days ≈ 365k rows/year, <50 MB with indexes.
CREATE TABLE analytics.daily_client_stats (
date date NOT NULL,
client_id uuid NOT NULL,
root_domain text NOT NULL,
conversations_started int NOT NULL DEFAULT 0,
messages_sent int NOT NULL DEFAULT 0,
widget_visitors int NOT NULL DEFAULT 0, -- daily-distinct persons
conversation_visitors int NOT NULL DEFAULT 0,
email_only_persons int NOT NULL DEFAULT 0,
demo_persons int NOT NULL DEFAULT 0,
converted_persons int NOT NULL DEFAULT 0,
cta_clicked int NOT NULL DEFAULT 0,
form_submitted int NOT NULL DEFAULT 0,
computed_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (date, client_id)
);
Refresh is incremental and re-entrant per day. The function takes a target date and upserts every (date, client_id) row for it from analytics_stg.*:
CREATE OR REPLACE FUNCTION analytics.refresh_daily_client_stats(p_date date)
RETURNS void LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO analytics.daily_client_stats AS d
(date, client_id, root_domain,
conversations_started, messages_sent,
widget_visitors, conversation_visitors, ...)
SELECT p_date, client_id, root_domain,
COUNT(DISTINCT conversation_id),
COUNT(message_id),
COUNT(DISTINCT widget_person_id),
COUNT(DISTINCT conversation_person_id),
...
FROM analytics_stg.stg_conversations c
LEFT JOIN analytics_stg.stg_widget_impressions w USING (...)
WHERE c.started_at >= p_date
AND c.started_at < p_date + 1
GROUP BY client_id, root_domain
ON CONFLICT (date, client_id) DO UPDATE SET
conversations_started = EXCLUDED.conversations_started,
messages_sent = EXCLUDED.messages_sent,
widget_visitors = EXCLUDED.widget_visitors,
conversation_visitors = EXCLUDED.conversation_visitors,
...
computed_at = now();
END $$;
Orchestration:
- Nightly at 02:00 UTC (inside
analytics.refresh_all_nightly()from §4):PERFORM analytics.refresh_daily_client_stats(current_date - 1). Yesterday is closed. - Hourly (separate cron):
PERFORM analytics.refresh_daily_client_stats(current_date). Today's row stays fresh. Bounded scan — one day of events regardless of how much history exists. - Backfill runs the same function in a loop over the historical date range, idempotent.
The range-driven RPCs in Context §5 are rewritten against this table (same signatures, same outputs):
-- replaces get_conversations_over_time
SELECT date_trunc(p_group_by, date) AS bucket,
SUM(conversations_started) AS conversations,
SUM(conversation_visitors) AS unique_visitors,
SUM(messages_sent) AS messages
FROM analytics.daily_client_stats
WHERE root_domain = p_domain
AND date BETWEEN p_start AND p_end
GROUP BY 1 ORDER BY 1;
For 7d / 14d / 30d / 90d / arbitrary YTD: same query, only the predicate changes. Scan size bounded by (window_days × clients_in_scope) — typically <400 rows.
Cross-day uniques caveat. SUM(daily_unique_persons) over an N-day range double-counts persons who returned on multiple days. Two options depending on which metric the dashboard intends:
- Daily-active uniques (the metric most BI tools call "DAU-like"): the SUM is correct as-is. This matches what the current 30d MV exposes since it never deduped across days at finer grain anyway.
- Unique persons over a window: add a thin
analytics.person_day(date, client_id, person_id)companion, refreshed the same way.COUNT(DISTINCT person_id) WHERE date BETWEEN ?from that table. ~10× larger than the fact, still cheap.
9b. analytics.visitor_timeline_events¶
Denormalized event stream for the per-visitor timeline. Replaces the three-way raw scan + JS merge in useTimelineData.ts.
CREATE TABLE analytics.visitor_timeline_events (
id bigint GENERATED BY DEFAULT AS IDENTITY,
visitor_id uuid NOT NULL,
session_id uuid,
occurred_at timestamptz NOT NULL,
event_kind text NOT NULL, -- 'page_view' | 'session_event' | 'message' | 'cta_clicked' | ...
payload jsonb,
PRIMARY KEY (visitor_id, occurred_at, id)
);
Maintained by AFTER INSERT triggers on public.visitor_page_views, public.session_events (filtered to the same event-name allowlist the frontend already uses), and public.messages. Insert-only — no updates, no full rebuild. The dedupe-by-minute logic that lives in JS today moves into the trigger (or into a view on top of the table — but applied once, not per request).
The timeline hook collapses to:
SELECT id, session_id, occurred_at, event_kind, payload
FROM analytics.visitor_timeline_events
WHERE visitor_id = $1
ORDER BY occurred_at DESC
LIMIT 200; -- cursor-paginated
Single indexed range scan. The merge / sort / dedupe block in useTimelineData.ts:233-310 is deleted.
Backfill runs once over the union of source tables; afterwards the triggers keep it in sync.
9c. Replacing mv_client_stats_30d with a view¶
The 30-day rollup that today lives in mv_client_stats_30d is, after §9a, a plain view:
CREATE VIEW analytics.client_stats_30d AS
SELECT client_id,
root_domain,
SUM(widget_visitors) AS widget_visitors,
SUM(conversation_visitors) AS conversation_visitors,
SUM(email_only_persons) AS email_only_persons,
SUM(demo_persons) AS demo_persons,
SUM(converted_persons) AS converted_persons,
SUM(conversion_persons)::numeric
/ NULLIF(SUM(widget_visitors), 0) AS conversion_rate,
...
FROM analytics.daily_client_stats
WHERE date >= current_date - 30
GROUP BY client_id, root_domain;
No separate refresh function. No drift possible by construction. The all-clients listing page (ClientsPage) repoints to this view; its consumers see the same column set as today's mart. Per-client medians (currently computed in get_client_stats_30d via PERCENTILE_CONT) stay in the RPC body, computed over the view's output — same code, smaller input.
The same view shape extends trivially to client_stats_7d, client_stats_90d, etc. if those become useful — each is one WHERE predicate.
9d. Downstream unlock¶
Once dashboards no longer read raw session_events directly (daily_client_stats for aggregates, visitor_timeline_events for drill-down), retention rules on raw events become safe. The 2026-05-12 velocity audit found that only 0.33% of visitor_sessions ever produce a conversation (15,126 of 4.58M). A retention rule of "delete session_events and visitor_page_views for sessions with no conversation after 30 days" would shrink session_events from 8.6 GB to under 100 MB without any dashboard impact. Implementation of that rule is out of scope for this ADR but is structurally enabled by §9.
10. Parallel run and programmatic comparison¶
Because the new pipeline lives in a separate schema (analytics_stg, analytics), the legacy MVs (mv_widget_visitors, mv_conversation_visitors, mv_client_stats_30d) and legacy RPC bodies can keep running untouched while the new marts populate. Cutover is gated on a comparison harness, not on subjective inspection. This replaces the single "within 1%" invariant that earlier drafts of §5 carried.
The rollout is five phases:
Phase A — Shadow. Land schemas, helpers, staging views, full-rebuild marts (§3), daily-grain mart (§9a), and the visitor-timeline mart + triggers (§9b). Cron jobs from §4 populate them on schedule. Backoffice still reads legacy MVs and legacy RPCs. The new RPCs from §6/§7 exist with _v2 suffixes so PostgREST exposes both surfaces.
Phase B — Compare. A analytics.compare_*() family of functions diffs legacy vs new outputs for every RPC signature and every supported time window. Results are persisted to:
CREATE TABLE analytics.rollout_deltas (
run_at timestamptz NOT NULL DEFAULT now(),
rpc_name text NOT NULL,
domain text,
period text NOT NULL, -- '7d' | '30d' | '90d' | 'ytd'
metric text NOT NULL,
legacy_value numeric,
new_value numeric,
abs_delta numeric,
rel_delta numeric, -- abs_delta / NULLIF(legacy_value, 0)
PRIMARY KEY (run_at, rpc_name, domain, period, metric)
);
One row per (rpc, domain, window, metric). Scale: ~1,000 active domains × ~10 RPCs × 4 windows × ~5 metrics ≈ 200k rows/run, partitioned-or-trimmed to last 30 runs (~6M rows, <500 MB). The comparison runs inside analytics.refresh_all_nightly() after both legacy and new marts have refreshed, so the same nightly window also produces the diff record.
-- shape of one compare_* function (per RPC)
INSERT INTO analytics.rollout_deltas (rpc_name, domain, period, metric, legacy_value, new_value, abs_delta, rel_delta)
SELECT 'get_conversation_funnel', d.domain, '30d', m.metric,
l.v AS legacy_value, n.v AS new_value,
abs(l.v - n.v),
abs(l.v - n.v) / NULLIF(l.v, 0)
FROM active_domains d
CROSS JOIN unnest(ARRAY['widget_displayed','conversation_started','email_captured','demo_booked']) AS m(metric)
LEFT JOIN LATERAL (SELECT … FROM get_conversation_funnel(d.domain, now()-'30d'::interval, now())) l(v) ON true
LEFT JOIN LATERAL (SELECT … FROM get_conversation_funnel_v2(d.domain, now()-'30d'::interval, now())) n(v) ON true;
Comparisons are scoped to the supported call shape only — i.e. the default values of p_excluded_person_ids, p_environment='production', p_min_session_duration=10. Non-default calls of those parameters are explicitly out of scope for the daily-grain wrappers (see HIGH-1 in the review notes); if a caller needs them, they keep reading the legacy path or get a separate raw-event RPC. The harness documents which call shapes it covers.
Phase C — Gate. Cutover is blocked until rollout_deltas shows:
- Coverage: every active domain has at least 14 consecutive nightly comparison rows for every (rpc, period, metric) triple.
- Parity: ≥99% of triples have
abs(rel_delta) < 0.01(1%). - No outliers: no triple has
abs(rel_delta) > 0.05(5%) in the last 7 runs. - No NULL drift: counts of legacy=NULL/new=non-NULL (and vice versa) per metric are within tolerance documented per RPC (some metrics are NULL by design when sample size is zero).
The gate is one SQL query, runnable on demand and as a CI check before the cutover migration is merged.
Phase D — Cutover. Backoffice service files repoint from rpc('get_*') to rpc('get_*_v2') (or, equivalently, swap the function bodies behind the original names and drop the _v2 aliases). Legacy MVs and their cron jobs stay running for one more week. The comparison harness keeps writing to rollout_deltas, so a post-cutover regression surfaces as a fresh row, not as a customer complaint.
Phase E — Drop. One migration removes the legacy MVs, their cron jobs, the _v2 aliases (if used), and the compare_* functions. rollout_deltas is retained as a historical artifact (small, append-only) or dropped — either is fine.
Cost during overlap. The new daily-grain refresh runs in seconds against current volumes. The legacy MV refresh stays on its existing 4-hour cadence. Running both for ~2 weeks doubles analytics compute on the nightly window — bounded and acceptable. After Phase E the steady state is strictly cheaper than today (one orchestrator, incremental daily upsert, no MV churn).
Why this section, not just an invariant. The earlier draft folded backward-compat into assert_invariants(). That conflates two distinct concerns: (a) structural correctness of the new marts, which must hold forever, and (b) parity with legacy, which only matters during rollout and would be deleted once the legacy is gone. Splitting them keeps assert_invariants() stable and makes the cutover gate inspectable as data (SELECT … FROM rollout_deltas) rather than as a binary refresh-failed signal.
Consequences¶
Positive¶
- Filter logic lives in one place per concept. Editing
analytics_stg.stg_conversationspropagates to every downstream consumer on the next read. - Mart logic changes are clean diffs. A fix to a full-rebuild mart is one
CREATE OR REPLACE FUNCTION analytics.refresh_<mart>()migration (e.g.refresh_visitor_session_metrics), not a 100-line MV re-paste. A fix to a daily-grain mart is oneCREATE OR REPLACE FUNCTION analytics.refresh_<mart>(p_date date)migration; backfill replays the function over historical dates. - Refresh order is correct by construction. No more "rollup built on stale parents" windows.
- Bad data fails the refresh loudly via
assert_invariants()instead of silently poisoning dashboards. - TypeScript shrinks measurably. ~12 leaked aggregation paths in
services/collapse to RPC wrappers.fetchConversationStats+ 3 siblings drop from 4 round-trips to 1. fetchTopicStatsstops shipping the entiretopicscolumn to the browser for busy domains.- No new infrastructure. Same Supabase, same
pg_cron, same RPC mechanism. No external scheduler, no warehouse, no dbt. - Cheap path forward. When freshness needs increase, swap the orchestrator to Cloud Scheduler invoking a dbt-like job — the staging/marts shape already maps onto dbt models if we ever adopt it.
- Range-driven dashboards stop scanning history per request. §9 collapses
get_conversations_over_time,get_conversation_funnel,get_engagement_ratetoSUM(...) WHERE date BETWEEN ?over ≤400 rows. Cost is bounded by the user-visible window, not by retained event volume. - Visitor timeline goes from three raw scans + JS merge to one indexed range scan. The JS aggregation in
useTimelineData.ts:233-310is deleted. - Raw event retention becomes safe to enable. Dashboards no longer depend on
session_events/visitor_page_views, so a retention rule keyed on "session had no conversation" can shrink the largest tables by >99% (see §9d) without breaking analytics.
Negative¶
- Loss of
REFRESH MATERIALIZED VIEW CONCURRENTLY. The atomic swap takes a millisecondACCESS EXCLUSIVElock. Acceptable at 2 AM UTC against backoffice-only consumers; would not be acceptable for tables read by hot-path widget traffic. - One-time migration cost. ~8 working days of focused refactoring (one engineer) plus the ergonomic cost of co-existing old and new paths during steps 3–5.
- Two-schema mental model for new contributors.
analytics_stgvsanalyticsis clearer than the current sprawl, but it is one more concept to teach. Mitigated by asupabase/CLAUDE.mdsection and the schema-name convention. - Incremental refresh applies only to daily-grain marts. §9 introduces
daily_client_statsandvisitor_timeline_eventsas incremental marts. Other marts in §3 (client_stats_30dsnapshot,widget_visitors,conversation_visitorsif kept as row-per-entity) retain the full-rebuild pattern. A future non-daily mart whose semantics don't decompose into daily slices (e.g. a 90-day rolling distinct-persons cube) would need a bespokeappend_*_since(ts)function and watermark tracking — out of scope here. - Single Supabase remains the analytics compute target. A botched
refresh_all_nightly()could lock backoffice tables briefly. Risk is bounded by the 2 AM schedule and atomic swap; not eliminated. The hourlyrefresh_all_hourly()only upserts today's row ofdaily_client_stats(no swap, no exclusive lock), so a botched run there has narrower blast radius.
Neutral¶
- Migration count temporarily increases during the rollout: schemas, helpers, staging views, full-rebuild marts (§3), daily-grain mart + backfill (§9a), visitor-timeline mart + triggers + backfill (§9b), RPC repoint, MV cleanup. ~7–8 separate PRs. After cleanup, analytics-related migrations should drop to roughly one per logic change instead of one per logic change plus MV recreation overhead.
pg_cronremains the scheduler. This ADR does not introduce Cloud Scheduler or any external orchestrator. If we later need sub-hourly freshness, a follow-up ADR will move the schedule outside Postgres.- Generated TypeScript types change.
cd frontend && just gen-typesafter each step;mv_*types disappear in the cleanup step.
Alternatives Considered¶
Alternative A — Adopt dbt against the same Supabase¶
Run dbt nightly via a Cloud Run Job, materializing models in an analytics schema, with dbt tests replacing in-DB asserts.
Rejected for now: dbt earns its complexity at ≥20 models, with a separate warehouse, or with non-engineer SQL authors. We have ~5 marts (3 full-rebuild + daily_client_stats + visitor_timeline_events) plus a small set of views on top, one shared Supabase, and only engineers in the SQL. Adopting dbt now means two ownership models for SQL (migrations vs dbt), two schedulers, a new deploy surface, and a new dependency to keep current — for benefits we can replicate with PL/pgSQL functions and a 40-line assert_invariants(). Re-evaluate when any of: model count > 20, a non-engineer needs to write SQL, we move analytics off Supabase to BigQuery/Snowflake, or stats bugs reach customers because pure-SQL tests are insufficient.
Alternative B — Keep MVs, fix only the TypeScript leak¶
Leave the MV/RPC layer as-is and only push the four leaked aggregations into RPCs.
Rejected: it solves the smallest of the three pains. The MV recreation churn and refresh-ordering bugs would persist. Filter logic would still be duplicated across 6+ MV/RPC bodies. The TS leak is a symptom, not the disease.
Alternative C — Adopt the bronze/silver/gold (medallion) pattern¶
Rename layers to bronze / silver / gold.
Rejected: the medallion pattern earns its complexity in lakehouse environments with raw immutable ingest, schema-on-read, replay requirements, and decoupled storage/compute. Our "bronze" would be our own OLTP tables — they are not raw ingest, not replayable, and not append-only. Adopting the vocabulary would mislead future readers about what each tier is. Two layers (stg + marts) is the right fidelity for our scale; renaming them with a third tier we will not actually populate is cargo cult.
Alternative D — Replicate analytics to a separate Postgres / BigQuery¶
CDC from Supabase to a dedicated analytics warehouse where dbt or rebuild functions can run freely.
Rejected: at our data volumes the OLTP DB easily handles a 2 AM rebuild plus the hourly daily-grain upserts. Replication adds CDC infrastructure, freshness lag, an additional credential surface, and ongoing operational overhead. The backoffice still requires the marts in Postgres for RPC consumption, so a warehouse would be a third hop, not a replacement. Revisit when (a) refresh_all_nightly() materially impacts daytime workload, (b) the hourly refresh_all_hourly() runs longer than its period, or (c) we want session_events-grain analytics that no longer fit in OLTP storage even after the §9d retention rule.
Alternative E — Pure-view marts (no materialization)¶
Replace MVs with regular views, recomputed on every read directly over public.* tables.
Rejected: mv_client_stats_30d aggregates across millions of session events and joins three CTE chains. Computing it per page-load would add seconds to every backoffice dashboard render and put the cost on user clicks instead of a single nightly run. The materialization-as-table pattern preserves view-like edit ergonomics while keeping reads fast.
Note: this is not in conflict with §9c, where analytics.client_stats_30d becomes a view. The §9c view aggregates over the daily-grain mart (~30 rows per client), not over raw events. The Alternative-E rejection applies specifically to views over public.* source tables.