Observability¶
Rose currently uses several observability systems:
- Google Cloud Logging for backend application logs
- Sentry for error tracking and uptime monitoring (see Sentry Uptime Monitoring)
- PostHog for product analytics
- Langfuse for LLM tracing and prompt observability
- Superlog for an experimental OpenTelemetry test
Sentry Uptime Monitoring¶
Sentry polls the backend deep status probe (GET /status on ixsearch_api) to
detect outages in any of the widget's runtime dependencies — MongoDB, Redis,
Neo4j, Supabase, Azure OpenAI, OpenAI, Cohere, Langfuse.
How the endpoint works¶
| Endpoint | Purpose | Auth | Probes |
|---|---|---|---|
GET /health |
Liveness — Cloud Run probe | none | none |
GET /ping |
Ultra-light readiness | none | none |
GET /ready |
Cloud Run readiness — used by load balancer | none | MongoDB ping, Redis ping |
GET /status |
Deep status probe (Sentry uptime) | X-Status-Token header |
MongoDB dbStats, Redis DBSIZE, Neo4j RETURN 1, Supabase REST query, Azure OpenAI /openai/models, OpenAI /v1/models, Cohere reach, Langfuse reach |
Response shape (200 all-ok, 503 if any critical service is down):
{
"status": "ok",
"version": "1.407.0",
"environment": "production",
"public_ip": "34.x.x.x",
"timestamp": "2026-05-26T17:43:06.796311+00:00",
"services": {
"mongodb": {"status": "ok", "latency_ms": 157, "detail": "collections=14"},
"redis": {"status": "ok", "latency_ms": 70, "detail": "keys=64036"},
"neo4j": {"status": "ok", "latency_ms": 470, "detail": null},
"supabase": {"status": "ok", "latency_ms": 90, "detail": null},
"azure_openai": {"status": "ok", "latency_ms": 200, "detail": null},
"openai": {"status": "ok", "latency_ms": 1366, "detail": null},
"cohere": {"status": "ok", "latency_ms": 168, "detail": null},
"langfuse": {"status": "ok", "latency_ms": 135, "detail": null}
}
}
Critical services (mongodb, redis, neo4j, supabase, azure_openai, openai) flip
the global status to down and trigger HTTP 503. Non-critical (cohere,
langfuse) stay HTTP 200 but show up as degraded/down in the body — Sentry
can alert on partial regressions via body-match rules.
Code: backend/packages/ixweb/ixweb/routes/health.py.
Authentication¶
/status is not public — guarded by a shared-secret header so only Sentry
can hit it.
| Header | Form |
|---|---|
X-Status-Token: <secret> |
preferred |
Authorization: Bearer <secret> |
accepted (for Sentry compatibility) |
Missing or wrong token → HTTP 404 (not 401) to avoid leaking endpoint
existence to scrapers. Comparison uses hmac.compare_digest (timing-safe).
The secret is stored in GCP Secret Manager under the name
STATUS_PROBE_TOKEN in project inboundx. The backend fetches it via
ixinfra.utils.secret_manager.get_secret, which checks env first then falls
back to Secret Manager (cached per process via @lru_cache). Cloud Run's
default service account already has roles/secretmanager.secretAccessor at
project level, so no per-secret IAM binding is required.
Rotating the token¶
gcloud secrets versions add STATUS_PROBE_TOKEN \
--data-file=<(openssl rand -hex 32) \
--project=inboundx
The backend caches the token via @lru_cache for the lifetime of each worker
process. Restart the Cloud Run service (or trigger a new revision) to pick up
the new version. Then update the corresponding Sentry monitor's header value.
Configuring a Sentry Uptime Monitor¶
- Read the token out of Secret Manager:
-
In Sentry: Alerts → Uptime Monitors → Create Monitor.
-
Fill in:
-
URL: pick per environment
- production:
https://api.userose.ai/status - staging:
https://api-staging.userose.ai/status - test:
https://api-test.userose.ai/status
- production:
- Method:
GET - Interval: 1–5 minutes
- Timeout: 10 seconds (each backend probe is capped at 2 s; nine in parallel + network overhead fits comfortably)
- Headers:
- Name:
X-Status-Token - Value: paste the token from step 1
- Name:
- Expected status code:
200 -
Body match (optional):
"status":"ok"to alert when the endpoint returns 200 but a non-critical dependency isdegraded -
Alert routing: wire to the same channel as other backend Sentry alerts.
Local testing¶
The local backend respects the same token. Set it in backend/.env.local:
Restart just dev <environment>, then preview the response in preprod-ui at
http://localhost:3001/status — paste the same
token into the page (stored in localStorage under rose:status_probe_token).
Pick the endpoint with the API endpoint selector at the top.
Adding a new probe¶
In backend/packages/ixweb/ixweb/routes/health.py:
- Add
_probe_<service>() -> ServiceStatusthat wraps the check inasyncio.wait_for(..., timeout=STATUS_PROBE_TIMEOUT_S)and returnsServiceStatus(status, latency_ms, detail). - Add the name to the
nameslist and the coroutine toprobesinside_run_status_probes. - If the service is widget-critical, add its name to the
CRITICAL_SERVICESset so a failure flips HTTP to 503. - Use
_http_auth_probefor endpoints requiring a key,_http_reach_probefor unauthenticated reachability.
Superlog OpenTelemetry Test¶
Superlog is currently wired as a test for native OpenTelemetry traces, logs, and metrics. It is not the primary production observability system and should be treated as removable experiment code until the team explicitly decides to keep it.
The current Superlog test sends OTLP/HTTP data to:
The browser uses an inline sl_public_ ingest token. This token is public and
write-only, similar to a Sentry DSN or PostHog project token.
What It Does¶
Backend ixsearch_api:
- Initializes OpenTelemetry trace, metric, and log providers.
- Exports traces, metrics, and logs to Superlog over OTLP/HTTP.
- Instruments FastAPI, HTTPX, Requests, and Python logging.
- Adds chat route metrics:
chat.querieschat.query.duration- Adds trace spans around chat query handling.
Client backoffice:
- Initializes browser OpenTelemetry providers before PostHog and Sentry.
- Exports browser traces, metrics, and logs to Superlog over OTLP/HTTP.
- Instruments document load and fetch calls.
- Propagates trace headers only to the configured first-party integrations API
origin from
VITE_INTEGRATIONS_API_BASE_URL. - Registers the browser logger provider with
@opentelemetry/api-logs.
Developer tooling:
- Adds Superlog and OTel style skills under
.agents/skills/. - Links those skills under
.claude/skills/. - Updates
skills-lock.json.
How To Remove Superlog¶
Remove the code and dependencies in one PR. Do not remove Langfuse, Sentry, PostHog, or Google Cloud Logging as part of this cleanup; those are separate systems.
Backend¶
Remove the IXSearch API bootstrap:
- Delete
backend/apps/api/search/ixsearch_api/ixsearch_api/observability.py. - Remove
from .observability import init_observabilityfrombackend/apps/api/search/ixsearch_api/ixsearch_api/app.py. - Remove the
init_observability(app, service_version=API_VERSION)startup call frombackend/apps/api/search/ixsearch_api/ixsearch_api/app.py.
Remove route-level OTel usage from
backend/apps/api/search/ixsearch_api/ixsearch_api/routes/chat.py:
- Remove
from opentelemetry import metrics, trace. - Remove
from opentelemetry.trace import Status, StatusCode. - Remove the module-scope
_tracer,_meter,_chat_queries, and_chat_query_durationdeclarations. - Remove the
@_tracer.start_as_current_span(...)wrapper if no other tracing system replaces it. - Remove
_chat_queries.add(...),_chat_query_duration.record(...),span.record_exception(...), andspan.set_status(...)calls.
Remove backend dependencies from
backend/apps/api/search/ixsearch_api/pyproject.toml:
opentelemetry-apiopentelemetry-sdkopentelemetry-exporter-otlp-proto-httpopentelemetry-instrumentation-fastapiopentelemetry-instrumentation-httpxopentelemetry-instrumentation-loggingopentelemetry-instrumentation-requestsopentelemetry-semantic-conventions
Regenerate backend lockfiles after editing dependencies:
Frontend¶
Remove the client-backoffice browser bootstrap:
- Delete
frontend/client-backoffice/src/observability.ts. - Remove
import { initObservability } from './observability';fromfrontend/client-backoffice/src/main.tsx. - Remove the
initObservability();call fromfrontend/client-backoffice/src/main.tsx.
Remove client-backoffice dependencies from
frontend/client-backoffice/package.json:
@opentelemetry/api@opentelemetry/api-logs@opentelemetry/exporter-logs-otlp-http@opentelemetry/exporter-metrics-otlp-http@opentelemetry/exporter-trace-otlp-http@opentelemetry/instrumentation@opentelemetry/instrumentation-document-load@opentelemetry/instrumentation-fetch@opentelemetry/resources@opentelemetry/sdk-logs@opentelemetry/sdk-metrics@opentelemetry/sdk-trace-base@opentelemetry/sdk-trace-web@opentelemetry/semantic-conventions
Regenerate frontend lockfiles after editing dependencies:
Skills¶
Remove the Superlog-specific skills and symlinks if the experiment is fully removed:
.agents/skills/superlog-onboard/.agents/skills/otel-*-style/.agents/skills/otel-instrument-feature/.claude/skills/superlog-onboard.claude/skills/otel-*-style.claude/skills/otel-instrument-feature- the matching entries in
skills-lock.json
Keep unrelated observability skills that predate Superlog.
Verification After Removal¶
Run the checks that correspond to touched files:
cd backend
just mypy apps/api/search/ixsearch_api/ixsearch_api/app.py apps/api/search/ixsearch_api/ixsearch_api/routes/chat.py
For a full confidence check, also run:
Before merging, search for remaining Superlog references:
Expected result after complete removal: only historical docs, changelog entries, or PR notes should remain.