Skip to content

Observability

Rose currently uses several observability systems:

  • Google Cloud Logging for backend application logs
  • Sentry for error tracking and uptime monitoring (see Sentry Uptime Monitoring)
  • PostHog for product analytics
  • Langfuse for LLM tracing and prompt observability
  • Superlog for an experimental OpenTelemetry test

Sentry Uptime Monitoring

Sentry polls the backend deep status probe (GET /status on ixsearch_api) to detect outages in any of the widget's runtime dependencies — MongoDB, Redis, Neo4j, Supabase, Azure OpenAI, OpenAI, Cohere, Langfuse.

How the endpoint works

Endpoint Purpose Auth Probes
GET /health Liveness — Cloud Run probe none none
GET /ping Ultra-light readiness none none
GET /ready Cloud Run readiness — used by load balancer none MongoDB ping, Redis ping
GET /status Deep status probe (Sentry uptime) X-Status-Token header MongoDB dbStats, Redis DBSIZE, Neo4j RETURN 1, Supabase REST query, Azure OpenAI /openai/models, OpenAI /v1/models, Cohere reach, Langfuse reach

Response shape (200 all-ok, 503 if any critical service is down):

{
  "status": "ok",
  "version": "1.407.0",
  "environment": "production",
  "public_ip": "34.x.x.x",
  "timestamp": "2026-05-26T17:43:06.796311+00:00",
  "services": {
    "mongodb": {"status": "ok", "latency_ms": 157, "detail": "collections=14"},
    "redis": {"status": "ok", "latency_ms": 70, "detail": "keys=64036"},
    "neo4j": {"status": "ok", "latency_ms": 470, "detail": null},
    "supabase": {"status": "ok", "latency_ms": 90, "detail": null},
    "azure_openai": {"status": "ok", "latency_ms": 200, "detail": null},
    "openai": {"status": "ok", "latency_ms": 1366, "detail": null},
    "cohere": {"status": "ok", "latency_ms": 168, "detail": null},
    "langfuse": {"status": "ok", "latency_ms": 135, "detail": null}
  }
}

Critical services (mongodb, redis, neo4j, supabase, azure_openai, openai) flip the global status to down and trigger HTTP 503. Non-critical (cohere, langfuse) stay HTTP 200 but show up as degraded/down in the body — Sentry can alert on partial regressions via body-match rules.

Code: backend/packages/ixweb/ixweb/routes/health.py.

Authentication

/status is not public — guarded by a shared-secret header so only Sentry can hit it.

Header Form
X-Status-Token: <secret> preferred
Authorization: Bearer <secret> accepted (for Sentry compatibility)

Missing or wrong token → HTTP 404 (not 401) to avoid leaking endpoint existence to scrapers. Comparison uses hmac.compare_digest (timing-safe).

The secret is stored in GCP Secret Manager under the name STATUS_PROBE_TOKEN in project inboundx. The backend fetches it via ixinfra.utils.secret_manager.get_secret, which checks env first then falls back to Secret Manager (cached per process via @lru_cache). Cloud Run's default service account already has roles/secretmanager.secretAccessor at project level, so no per-secret IAM binding is required.

Rotating the token

gcloud secrets versions add STATUS_PROBE_TOKEN \
  --data-file=<(openssl rand -hex 32) \
  --project=inboundx

The backend caches the token via @lru_cache for the lifetime of each worker process. Restart the Cloud Run service (or trigger a new revision) to pick up the new version. Then update the corresponding Sentry monitor's header value.

Configuring a Sentry Uptime Monitor

  1. Read the token out of Secret Manager:
gcloud secrets versions access latest \
  --secret=STATUS_PROBE_TOKEN --project=inboundx
  1. In Sentry: Alerts → Uptime Monitors → Create Monitor.

  2. Fill in:

  3. URL: pick per environment

    • production: https://api.userose.ai/status
    • staging: https://api-staging.userose.ai/status
    • test: https://api-test.userose.ai/status
  4. Method: GET
  5. Interval: 1–5 minutes
  6. Timeout: 10 seconds (each backend probe is capped at 2 s; nine in parallel + network overhead fits comfortably)
  7. Headers:
    • Name: X-Status-Token
    • Value: paste the token from step 1
  8. Expected status code: 200
  9. Body match (optional): "status":"ok" to alert when the endpoint returns 200 but a non-critical dependency is degraded

  10. Alert routing: wire to the same channel as other backend Sentry alerts.

Local testing

The local backend respects the same token. Set it in backend/.env.local:

echo 'STATUS_PROBE_TOKEN=<value-from-secret-manager>' >> backend/.env.local

Restart just dev <environment>, then preview the response in preprod-ui at http://localhost:3001/status — paste the same token into the page (stored in localStorage under rose:status_probe_token). Pick the endpoint with the API endpoint selector at the top.

Adding a new probe

In backend/packages/ixweb/ixweb/routes/health.py:

  1. Add _probe_<service>() -> ServiceStatus that wraps the check in asyncio.wait_for(..., timeout=STATUS_PROBE_TIMEOUT_S) and returns ServiceStatus(status, latency_ms, detail).
  2. Add the name to the names list and the coroutine to probes inside _run_status_probes.
  3. If the service is widget-critical, add its name to the CRITICAL_SERVICES set so a failure flips HTTP to 503.
  4. Use _http_auth_probe for endpoints requiring a key, _http_reach_probe for unauthenticated reachability.

Superlog OpenTelemetry Test

Superlog is currently wired as a test for native OpenTelemetry traces, logs, and metrics. It is not the primary production observability system and should be treated as removable experiment code until the team explicitly decides to keep it.

The current Superlog test sends OTLP/HTTP data to:

https://intake.superlog.sh

The browser uses an inline sl_public_ ingest token. This token is public and write-only, similar to a Sentry DSN or PostHog project token.

What It Does

Backend ixsearch_api:

  • Initializes OpenTelemetry trace, metric, and log providers.
  • Exports traces, metrics, and logs to Superlog over OTLP/HTTP.
  • Instruments FastAPI, HTTPX, Requests, and Python logging.
  • Adds chat route metrics:
  • chat.queries
  • chat.query.duration
  • Adds trace spans around chat query handling.

Client backoffice:

  • Initializes browser OpenTelemetry providers before PostHog and Sentry.
  • Exports browser traces, metrics, and logs to Superlog over OTLP/HTTP.
  • Instruments document load and fetch calls.
  • Propagates trace headers only to the configured first-party integrations API origin from VITE_INTEGRATIONS_API_BASE_URL.
  • Registers the browser logger provider with @opentelemetry/api-logs.

Developer tooling:

  • Adds Superlog and OTel style skills under .agents/skills/.
  • Links those skills under .claude/skills/.
  • Updates skills-lock.json.

How To Remove Superlog

Remove the code and dependencies in one PR. Do not remove Langfuse, Sentry, PostHog, or Google Cloud Logging as part of this cleanup; those are separate systems.

Backend

Remove the IXSearch API bootstrap:

  • Delete backend/apps/api/search/ixsearch_api/ixsearch_api/observability.py.
  • Remove from .observability import init_observability from backend/apps/api/search/ixsearch_api/ixsearch_api/app.py.
  • Remove the init_observability(app, service_version=API_VERSION) startup call from backend/apps/api/search/ixsearch_api/ixsearch_api/app.py.

Remove route-level OTel usage from backend/apps/api/search/ixsearch_api/ixsearch_api/routes/chat.py:

  • Remove from opentelemetry import metrics, trace.
  • Remove from opentelemetry.trace import Status, StatusCode.
  • Remove the module-scope _tracer, _meter, _chat_queries, and _chat_query_duration declarations.
  • Remove the @_tracer.start_as_current_span(...) wrapper if no other tracing system replaces it.
  • Remove _chat_queries.add(...), _chat_query_duration.record(...), span.record_exception(...), and span.set_status(...) calls.

Remove backend dependencies from backend/apps/api/search/ixsearch_api/pyproject.toml:

  • opentelemetry-api
  • opentelemetry-sdk
  • opentelemetry-exporter-otlp-proto-http
  • opentelemetry-instrumentation-fastapi
  • opentelemetry-instrumentation-httpx
  • opentelemetry-instrumentation-logging
  • opentelemetry-instrumentation-requests
  • opentelemetry-semantic-conventions

Regenerate backend lockfiles after editing dependencies:

cd backend
poetry lock

Frontend

Remove the client-backoffice browser bootstrap:

  • Delete frontend/client-backoffice/src/observability.ts.
  • Remove import { initObservability } from './observability'; from frontend/client-backoffice/src/main.tsx.
  • Remove the initObservability(); call from frontend/client-backoffice/src/main.tsx.

Remove client-backoffice dependencies from frontend/client-backoffice/package.json:

  • @opentelemetry/api
  • @opentelemetry/api-logs
  • @opentelemetry/exporter-logs-otlp-http
  • @opentelemetry/exporter-metrics-otlp-http
  • @opentelemetry/exporter-trace-otlp-http
  • @opentelemetry/instrumentation
  • @opentelemetry/instrumentation-document-load
  • @opentelemetry/instrumentation-fetch
  • @opentelemetry/resources
  • @opentelemetry/sdk-logs
  • @opentelemetry/sdk-metrics
  • @opentelemetry/sdk-trace-base
  • @opentelemetry/sdk-trace-web
  • @opentelemetry/semantic-conventions

Regenerate frontend lockfiles after editing dependencies:

cd frontend
npm install
cd client-backoffice
npm install --package-lock-only --workspaces=false

Skills

Remove the Superlog-specific skills and symlinks if the experiment is fully removed:

  • .agents/skills/superlog-onboard/
  • .agents/skills/otel-*-style/
  • .agents/skills/otel-instrument-feature/
  • .claude/skills/superlog-onboard
  • .claude/skills/otel-*-style
  • .claude/skills/otel-instrument-feature
  • the matching entries in skills-lock.json

Keep unrelated observability skills that predate Superlog.

Verification After Removal

Run the checks that correspond to touched files:

cd backend
just mypy apps/api/search/ixsearch_api/ixsearch_api/app.py apps/api/search/ixsearch_api/ixsearch_api/routes/chat.py
cd frontend
just lint

For a full confidence check, also run:

cd backend
just test
cd frontend
just check

Before merging, search for remaining Superlog references:

rg "Superlog|superlog|intake\.superlog\.sh|sl_public_|initObservability|init_observability"

Expected result after complete removal: only historical docs, changelog entries, or PR notes should remain.