ADR: Agentic Development Pipeline¶

Status¶

Draft

Date¶

2026-03-04

Context¶

As Rose grows in scope (Website Agent, Nurturing Agent, and future agents), the team needs to scale development velocity without scaling headcount linearly. The current workflow is manual: a developer picks a Linear ticket, writes code, runs tests, creates a PR, and waits for review. This is effective but creates bottlenecks when multiple features are in flight.

Inspired by Sam Hessanower's agentic dev pipeline (AI Tinkerers / OneShot, Aug 2024), we want to explore autonomous AI agents that can take a Linear ticket from spec to PR — writing code, running tests, self-reviewing, and surfacing the result for human judgment.

Current State¶

What we already have that supports this:

Capability	Current State
Linear integration	MCP server configured, `linearis` CLI, Linear webhook Cloud Function
Claude Code skills	30+ skills (solve-linear-ticket, test-backend, test-frontend, mypy, etc.)
Claude Code settings	Permissions, hooks, MCP servers (Linear, Playwright, Figma)
CI/CD	GitHub Actions for backend/frontend/docs deploy, PR checks, security scans
Testing	pytest (unit/integration/smoke/evaluation markers), Vitest, Playwright
Infrastructure	GCP Cloud Run, Terraform, Docker multi-stage builds
Supabase	DB, auth, storage, migrations, edge functions
CLAUDE.md	Comprehensive codebase instructions for AI agents
Pre-commit hooks	Ruff, gitleaks, smoke tests on commit

What's missing:

Gap	Impact
No autonomous agent loop	Agent can't self-correct without human intervention
No isolated DB per agent run	Agents can't safely test DB-dependent features
No containerized agent runtime	Can't run agents in parallel, safely, reproducibly
No self-review rubric	No structured quality gate before human review
No Linear → agent trigger	Human must manually start the agent

Decision¶

Build one autonomous agent factory that runs different jobs (coding tasks, client onboarding, chatbot issue triage) on GCE spot VMs with Claude Sonnet 4.6 (cheap, fast, good enough). Orchestrated by GCP Cloud Workflows (serverless, native, pennies per run). Triggers are flexible: Linear webhooks, GitHub events, manual CLI, or scheduled cron. Human supervision via Slack notifications + Langfuse observability.

Architecture Overview¶

The factory is a single container that dispatches to different jobs based on the trigger payload:

flowchart TB classDef trigger fill:#5E6AD2,stroke:#4854c7,color:#fff classDef orchestrator fill:#24292f,stroke:#1b1f23,color:#fff classDef agent fill:#D97706,stroke:#b45309,color:#fff classDef review fill:#059669,stroke:#047857,color:#fff classDef notif fill:#DC2626,stroke:#b91c1c,color:#fff subgraph TRIGGERS["Triggers (flexible)"] LIN["Linear Webhook ai-ready label"]:::trigger GH["GitHub Event issue/dispatch"]:::trigger CRON["Cloud Scheduler daily / weekly"]:::trigger CLI["Manual CLI gcloud / script"]:::trigger end CW["Cloud Workflows (orchestrator)"]:::orchestrator subgraph GCE["GCE VM (e2-medium, spot)"] direction TB DISPATCH["factory/run.sh job dispatcher"]:::agent subgraph JOBS["Jobs (same container, different prompts)"] direction LR J1["solve-linear-ticket"]:::agent J2["onboard-client"]:::agent J3["solve-chatbot-issues"]:::agent end REVIEW["Self-review loop (Opus scores)"]:::review end SLACK["Slack Notifications"]:::notif LF["Langfuse Observability"]:::notif TRIGGERS --> CW --> GCE DISPATCH --> JOBS J1 --> REVIEW REVIEW -->|"score < 4"| J1 REVIEW -->|"scores ≥ 4"| PR["Draft PR"]:::agent GCE --> SLACK GCE --> LF

Design Principles¶

Principle	Decision
Model	Claude Sonnet 4.6 (`--model sonnet`) — fast, cheap, good enough for most tasks. Opus only for self-review scoring
Auth	Max subscription via `ANTHROPIC_AUTH_TOKEN` — $0/run, token refreshed from GCP Secret Manager
Orchestration	GCP Cloud Workflows — serverless, ~$0.001/execution, native GCP
Triggers	Flexible: Linear, GitHub, cron, manual CLI. Not locked to one source
Notifications	Slack webhooks for progress/completion/failure
Supervision	Langfuse traces for every agent run, cost tracking per run
Compute	GCE e2-medium spot — ~$0.01/hr, no time limits, full browser support

The Factory¶

One container, one entrypoint, different jobs. The job parameter determines which skill to run and with what params.

Entrypoint:

#!/usr/bin/env bash
# factory/run.sh — single entrypoint, dispatches by job type
set -euo pipefail

JOB="${JOB:?required}"

case "$JOB" in
  solve-linear-ticket)
    TICKET_ID="${TICKET_ID:?required}"
    BRANCH="${BRANCH:?required}"
    claude -p "Use the solve-linear-ticket skill to implement ${TICKET_ID}.
      Follow CLAUDE.md strictly. Run tests after each change.
      Create commits for each sub-issue." \
      --model sonnet --output-format json --max-turns 100 \
      --dangerously-skip-permissions

    # Self-review (Opus for better judgment)
    claude -p "Use the requesting-code-review skill to review all changes on this branch.
      Score each criterion 0-5. Output JSON." \
      --model opus --output-format json --max-turns 20 \
      --dangerously-skip-permissions
    ;;

  onboard-client)
    CLIENT_DOMAIN="${CLIENT_DOMAIN:?required}"
    MODE="${MODE:-full}"  # full | eval-only | update-skills
    claude -p "Use the onboard-client skill for ${CLIENT_DOMAIN}.
      Environment: staging. Use --local-prompt for all rose-chat and rose-eval calls.
      Target accuracy: 0.70. Max 3 eval iterations.
      Mode: ${MODE}." \
      --model sonnet --output-format json --max-turns 150 \
      --dangerously-skip-permissions
    ;;

  solve-chatbot-issues)
    claude -p "Use the solve-chatbot-issues skill.
      Check Sentry and Langfuse for recent chatbot failures.
      Triage, fix, and open PRs for each issue." \
      --model sonnet --output-format json --max-turns 100 \
      --dangerously-skip-permissions
    ;;

  *)
    echo "Unknown job: $JOB"
    exit 1
    ;;
esac

Jobs:

Job	Skill	Trigger	What It Does
`solve-linear-ticket`	`solve-linear-ticket`	Linear `ai-ready` (non-Onboarding projects), manual CLI	Ticket → code → tests → self-review → PR
`onboard-client`	`onboard-client`	Linear `ai-ready` (Onboarding project), manual CLI, KB update	Config + skills + eval dataset + accuracy tuning
`solve-chatbot-issues`	`solve-chatbot-issues`	Daily cron	Triage chatbot issues from Sentry/Langfuse, open fix PRs

Self-review loop (for solve-linear-ticket):

repeat (max 3 iterations):
  Opus scores each criterion (0-5)
  if any score < 4:
    Sonnet fixes the issues
    re-run tests
  else:
    create/update PR, tag 'agent-ready'
    notify Slack
    exit

Invocation examples:

# Coding task from Linear ticket
gcloud workflows run factory --data='{"job":"solve-linear-ticket","ticket_id":"abc123","ticket_identifier":"IX-123"}'

# Client onboarding
gcloud workflows run factory --data='{"job":"onboard-client","client_domain":"acme.com","mode":"full"}'

# Daily chatbot triage (triggered by Cloud Scheduler)
gcloud workflows run factory --data='{"job":"solve-chatbot-issues"}'

Local Development¶

A CLI wrapper (rose-factory) makes it easy to run jobs locally in Docker:

#!/usr/bin/env bash
# rose-factory — run factory jobs locally
set -euo pipefail

usage() {
  echo "Usage: rose-factory <job> [options]"
  echo ""
  echo "Jobs:"
  echo "  solve-linear-ticket  --ticket IX-123"
  echo "  onboard-client       --domain acme.com [--mode full|eval-only|update-skills]"
  echo "  solve-chatbot-issues"
  echo ""
  echo "Options:"
  echo "  --no-docker    Run directly without Docker (requires Claude CLI installed)"
  exit 1
}

[[ $# -lt 1 ]] && usage

JOB="$1"; shift
NO_DOCKER=false
EXTRA_ENV=()

while [[ $# -gt 0 ]]; do
  case "$1" in
    --ticket)     EXTRA_ENV+=(-e "TICKET_ID=$2" -e "BRANCH=feature/$2"); shift 2 ;;
    --domain)     EXTRA_ENV+=(-e "CLIENT_DOMAIN=$2"); shift 2 ;;
    --mode)       EXTRA_ENV+=(-e "MODE=$2"); shift 2 ;;
    --no-docker)  NO_DOCKER=true; shift ;;
    *)            echo "Unknown option: $1"; usage ;;
  esac
done

if [ "$NO_DOCKER" = true ]; then
  # Run directly — extract env vars from EXTRA_ENV flags
  for e in "${EXTRA_ENV[@]}"; do
    [[ "$e" == "-e" ]] && continue
    export "$e"
  done
  export JOB
  exec ./factory/run.sh
fi

docker compose -f factory/docker-compose.yml run --rm \
  -e "JOB=$JOB" "${EXTRA_ENV[@]}" \
  factory

Examples:

# Run a Linear ticket locally in Docker
rose-factory solve-linear-ticket --ticket IX-123

# Onboard a client without Docker
rose-factory onboard-client --domain acme.com --mode full --no-docker

# Chatbot triage
rose-factory solve-chatbot-issues

Infrastructure¶

Container¶

FROM node:22-bookworm

RUN npm install -g @anthropic-ai/claude-code
RUN npx playwright install --with-deps chromium
RUN npm install -g supabase
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . .

# factory/docker-compose.yml
services:
  factory:
    build: .
    entrypoint: ["./factory/run.sh"]
    env_file: .env.agent

Directory Structure¶

factory/
├── Dockerfile                # Container image (Claude Code + Playwright + Supabase CLI)
├── docker-compose.yml        # Single factory service
├── run.sh                    # Job dispatcher entrypoint
├── self-review/
│   ├── rubric.md             # Scoring criteria
│   └── review-loop.sh        # Self-review orchestrator
├── orchestration/
│   ├── factory.yaml          # Cloud Workflow definition
│   └── notify.sh             # Slack notification helper
└── seeds/
    ├── base.sql              # Empty schema, minimal seed
    ├── auth-users.sql        # Admin + regular user with tokens
    ├── content-populated.sql # Realistic content volume
    └── multi-tenant.sql      # 2+ orgs with separate data

Authentication¶

Max subscription auth via ANTHROPIC_AUTH_TOKEN — $0/run (included in subscription).

# Generate a long-lived token and store it in GCP Secret Manager
./scripts/secrets/refresh-claude-token.sh

The script runs claude setup-token (opens browser), then prompts you to paste the token and stores it in Secret Manager. The token is injected into Docker containers at runtime via env var:

docker run \
  -e ANTHROPIC_AUTH_TOKEN="$(gcloud secrets versions access latest --secret=CLAUDE_AUTH_TOKEN)" \
  rose-agent claude -p "..."

Note: claude -p inside Docker is the official CLI binary, so subscription OAuth works. Token expires periodically — refresh by re-running the script.

All secrets injected via GCP Secret Manager:

Secret	Used By
`CLAUDE_AUTH_TOKEN`	Max subscription auth ($0/run)
`LINEAR_API_KEY`	Linear MCP server
`GITHUB_TOKEN`	Git push, PR creation
`SLACK_WEBHOOK_URL`	Notifications
`LANGFUSE_SECRET_KEY`	LLM observability

GCE VM (one-time setup)¶

gcloud compute instances create rose-agent-runner \
  --zone=europe-west9-a \
  --machine-type=e2-medium \
  --provisioning-model=SPOT \
  --instance-termination-action=STOP \
  --boot-disk-size=50GB \
  --image-family=cos-stable \
  --image-project=cos-cloud

Orchestration — GCP Cloud Workflows¶

Why Cloud Workflows (not Airflow):

	Cloud Workflows	Cloud Composer (Airflow)
Cost	~$0.001/execution	~$250/mo minimum (GKE + Cloud SQL)
Setup	YAML file, deploy in 1 min	30+ min environment creation
Long-running	Up to 1 year timeout	Unlimited
Maintenance	Zero (serverless)	Cluster upgrades, DAG syncing
Fit	Small team, simple flows	Large team, complex DAGs

Cloud Composer is overkill for our use case. Cloud Workflows is serverless, costs pennies, and handles the linear flow (trigger → start VM → run container → notify) perfectly.

Factory workflow (single workflow, job dispatched via params):

# factory/orchestration/factory.yaml
main:
  params: [args]
  steps:
    - init:
        assign:
          - job: ${args.job}
          - job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}

    - notify_start:
        call: http.post
        args:
          url: ${SLACK_WEBHOOK_URL}
          body:
            text: "🏭 Factory started: ${job_label}"

    - start_vm:
        call: googleapis.compute.v1.instances.start
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner

    - run_agent:
        call: googleapis.compute.v1.instances.runCommand
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner
          command: >
            cd /app && git fetch origin
            && docker compose run --rm factory
          env: ${args}
        result: agent_result

    - notify_done:
        call: http.post
        args:
          url: ${SLACK_WEBHOOK_URL}
          body:
            text: "✅ Factory done: ${job_label}"

    - stop_vm:
        call: googleapis.compute.v1.instances.stop
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner

  except:
    as: e
    steps:
      - notify_failure:
          call: http.post
          args:
            url: ${SLACK_WEBHOOK_URL}
            body:
              text: "🚨 Factory FAILED: ${job_label}\nError: ${e.message}"
      - stop_vm_on_error:
          call: googleapis.compute.v1.instances.stop
          args:
            project: ${GCP_PROJECT_ID}
            zone: europe-west9-a
            instance: rose-agent-runner
      - raise_error:
          raise: ${e}

Triggers¶

Triggers are decoupled from jobs — any trigger can invoke any job via Cloud Workflows.

Trigger	Source	Job	How
Linear `ai-ready` label	Linear webhook → Cloud Function	Routed by project	Cloud Function maps project → job (see below)
GitHub `repository_dispatch`	GitHub webhook	`solve-linear-ticket`	Optional — if we want GitHub issues to trigger runs
Manual CLI	Developer's terminal	Any	`gcloud workflows run factory --data='{"job":"...","ticket_id":"IX-123"}'`
Daily cron	Cloud Scheduler	`solve-chatbot-issues`	Daily triage of chatbot issues from Sentry/Langfuse
Weekly cron	Cloud Scheduler	`onboard-client`	Scheduled `eval-only` runs for all active clients
KB update	Document upload webhook	`onboard-client`	`update-skills` mode for the changed client

Linear ai-ready routing: One label, one webhook — the Cloud Function routes by Linear project:

Linear Project	Job
"Onboarding"	`onboard-client` (extracts client domain from ticket)
Any other project	`solve-linear-ticket` (default)

Why triggers are flexible (not locked to Linear or GitHub):

Same ai-ready label works across all projects — routing is automatic
Sometimes you just want to kick off a run from the CLI without creating a ticket
Cron-based jobs (chatbot triage, re-evaluation) don't map to any issue tracker event
GitHub integration is nice-to-have (auto-trigger from issue labels) but not required

Notifications & Supervision¶

Slack Notifications¶

Every factory run posts to a #agent-factory Slack channel:

Event	Message
Run started	Factory type, ticket/client, branch
Progress update	Every 15 min during long runs (heartbeat)
Self-review scores	Per-criterion scores, pass/fail
Run completed	PR link or accuracy report
Run failed	Error summary, link to logs

Implemented via a simple notify.sh helper that posts to a Slack webhook.

Langfuse Observability¶

Every claude -p call is traced in Langfuse (already configured in the project). This gives:

Token usage per run — track cost per ticket / per client
Latency breakdown — how long each agent turn takes
Prompt versions — which skill versions were used
Failure analysis — inspect the full agent conversation when something goes wrong

Cost Tracking¶

Component	Cost
Claude Sonnet 4.6 (Max subscription)	$0/run (included in subscription)
GCE e2-medium spot	~$0.01/hr (~$0.003–0.04/run)
Cloud Workflows	~$0.001/execution
Slack webhook	Free
Langfuse	Free tier (50k traces/mo)
Total per run	~$0.01–0.05 compute only

The Max subscription makes this extremely cheap — the main cost is just the VM compute time. Still use --max-turns to prevent runaway loops that waste subscription quota.

Supervision Dashboard¶

No custom dashboard needed initially. Use existing tools:

Langfuse → LLM traces, cost, latency
Cloud Logging → VM and container logs
Slack → real-time notifications
GitHub → PR status, CI checks
Linear → ticket status updates

If we need a consolidated view later, build a simple Cloud Monitoring dashboard.

Supabase Database Branching¶

Goal: Each solve-linear-ticket run gets its own isolated database.

Use Supabase Branching to create ephemeral DB branches per agent run:

# supabase/config.toml
[db.seed]
enabled = true
sql_paths = ["./seed.sql", "./seeds/*.sql"]

Seed strategy:

Seed File	Contents	When to Use
`base.sql`	Schema + minimal reference data	Default for all runs
`auth-users.sql`	Admin + regular users with mock tokens	Features touching auth
`content-populated.sql`	Realistic volume for UI testing	Frontend-heavy features
`multi-tenant.sql`	Multiple orgs with separate data	Multi-tenant features

The onboard-client and solve-chatbot-issues jobs don't need DB branching — they use the staging environment directly.

Harness Engineering Best Practices¶

Based on Anthropic's harness engineering guidance and the broader cloud agent ecosystem, these practices apply to all jobs.

CLI vs SDK — When to Use Which¶

	CLI (`claude -p`)	Agent SDK (`@anthropic-ai/claude-agent-sdk`)
When	Scripts, CI/CD, existing skills, agent teams	Custom agent loops, programmatic control, non-coding agents
Pros	Zero code, reads `.mcp.json` + skills automatically, `--resume` for multi-turn	Full event streaming, tool approval callbacks, session management
Cons	Less control over agent loop	Each user = separate subprocess, locked to Anthropic
Our choice	CLI for all jobs — skills and MCP servers work automatically	Consider SDK later for non-coding agents (legal review, data analysis)

The SDK spawns the same CLI process under the hood. For our use case, the CLI is simpler and gets all skills/MCP servers for free.

Harness quality matters: Anthropic's benchmarks show the same Opus model scoring 78% with Claude Code's harness vs 42% with a naive harness on CORE benchmark — a 36-point gap from harness engineering alone.

Safety Layers (Defense in Depth)¶

--dangerously-skip-permissions is required in headless mode (no human to click "Allow"), but we mitigate risk through layered defenses:

Layer 1: Container isolation (microVM or Docker with no host socket)
Layer 2: Network allow/deny lists (only approved endpoints)
Layer 3: --allowedTools whitelist (restrict available tools)
Layer 4: PreToolUse hooks (block specific dangerous patterns)
Layer 5: Filesystem boundaries (agent can only write to workspace)
Layer 6: Cost circuit breakers (kill after $X or N errors)

Concrete implementation:

# Safer than naked --dangerously-skip-permissions:
# Note: --allowedTools uses permission rule syntax (space before * for prefix match)
claude -p "..." \
  --model sonnet \
  --max-turns 100 \
  --output-format stream-json \
  --allowedTools "Read,Write,Edit,Glob,Grep,Bash(git *),Bash(npm run *),Bash(just *),mcp__playwright *" \
  --dangerously-skip-permissions

Alternatives to --dangerously-skip-permissions (from safest to most permissive):

Method Best for

Native sandbox (Seatbelt/bubblewrap) Day-to-day dev

--allowedTools per-session whitelist CI/CD with specific needs

settings.json allowlists Team-wide policies

PreToolUse hooks Custom business logic

--permission-prompt-tool (MCP) Enterprise audit trails

--permission-mode acceptEdits Moderate autonomy

Docker Sandbox + --dangerously-skip-permissions Fully automated (our choice)

PreToolUse hooks as guardrails (hooks fire even in bypass mode):

// .claude/settings.json — block dangerous patterns
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "echo $CLAUDE_TOOL_INPUT | python3 -c \"import sys,json; cmd=json.load(sys.stdin)['command']; sys.exit(1 if any(w in cmd for w in ['rm -rf /','docker socket','curl.*|.*sh','wget.*|.*bash']) else 0)\""
        }]
      }
    ]
  }
}

Docker Container Hardening¶

FROM node:22-bookworm

# Claude Code CLI
RUN npm install -g @anthropic-ai/claude-code

# Playwright with Chromium
RUN npx playwright install --with-deps chromium

# Supabase CLI for DB branching
RUN npm install -g supabase

# Python tooling
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*

# Non-root user for safety
RUN useradd -m agent
USER agent

WORKDIR /app
COPY --chown=agent:agent . .

Critical rules:

Rule	Why
Never mount `/var/run/docker.sock`	Agent could escape sandbox via Docker API
Use `--network=none` or allow-list	Prevent data exfiltration via prompt injection
Non-root user inside container	Limit blast radius of filesystem operations
No host volume mounts for secrets	Use env vars from Secret Manager instead
Set resource limits	`--memory=4g --cpus=2` to prevent resource exhaustion

# factory/docker-compose.yml (hardened)
services:
  factory:
    build: .
    entrypoint: ["./factory/run.sh"]
    env_file: .env.agent
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2"
    networks:
      - agent-net
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp

networks:
  agent-net:
    driver: bridge
    # Allow-list outbound traffic via iptables rules on the host

Output Format & Error Handling¶

Use stream-json for real-time monitoring, json for structured results:

# Stream events in real-time (for heartbeat/progress monitoring)
claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
  TYPE=$(echo "$line" | jq -r '.type // empty')
  case "$TYPE" in
    "assistant")
      # Agent is producing output — heartbeat is alive
      ./factory/orchestration/notify.sh heartbeat "$TICKET_ID"
      ;;
    "result")
      # Final result — extract cost and session info
      COST=$(echo "$line" | jq -r '.cost_usd')
      TURNS=$(echo "$line" | jq -r '.num_turns')
      ./factory/orchestration/notify.sh completed "$TICKET_ID" "$COST" "$TURNS"
      ;;
    "error")
      ./factory/orchestration/notify.sh failed "$TICKET_ID" "$(echo "$line" | jq -r '.error')"
      ;;
  esac
done

# Check exit code
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
  ./factory/orchestration/notify.sh failed "$TICKET_ID" "Exit code: $EXIT_CODE"
fi

Cost Circuit Breakers¶

Prevent runaway agent spending:

# factory/code/run.sh — with circuit breakers
MAX_COST_USD=10.00
MAX_CONSECUTIVE_ERRORS=5
MAX_DURATION_SECONDS=7200  # 2 hours

STARTED_AT=$(date +%s)

claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
  # Check duration
  ELAPSED=$(( $(date +%s) - STARTED_AT ))
  if [ "$ELAPSED" -gt "$MAX_DURATION_SECONDS" ]; then
    echo "CIRCUIT BREAKER: Duration exceeded ${MAX_DURATION_SECONDS}s"
    kill %1  # Kill the claude process
    exit 1
  fi

  # Check cost (from stream-json events)
  COST=$(echo "$line" | jq -r '.cost_usd // 0')
  if (( $(echo "$COST > $MAX_COST_USD" | bc -l) )); then
    echo "CIRCUIT BREAKER: Cost exceeded \$${MAX_COST_USD}"
    kill %1
    exit 1
  fi
done

Heartbeat & Liveness Monitoring¶

# factory/orchestration/heartbeat.sh — background process
HEARTBEAT_INTERVAL=900  # 15 minutes
HEARTBEAT_TIMEOUT=1800  # 30 minutes — agent is stuck if no output for this long

LAST_ACTIVITY=$(date +%s)

monitor_heartbeat() {
  while true; do
    sleep "$HEARTBEAT_INTERVAL"
    IDLE_TIME=$(( $(date +%s) - LAST_ACTIVITY ))

    if [ "$IDLE_TIME" -gt "$HEARTBEAT_TIMEOUT" ]; then
      ./factory/orchestration/notify.sh stuck "$TICKET_ID" "No activity for ${IDLE_TIME}s"
      # Don't kill — human decides
    else
      ./factory/orchestration/notify.sh heartbeat "$TICKET_ID" "Running (${IDLE_TIME}s since last activity)"
    fi
  done
}

Long-Running Agent Sessions¶

Per Anthropic's guidance, context windows are limited and complex projects can't complete in a single session. Claude Code handles this via compaction — automatically summarizing context when approaching limits. Key practices:

Let compaction work: Don't set --max-turns too low. Sonnet can do 100+ turns with compaction
Checkpoint commits: Configure the agent to commit after each sub-task, creating rollback points
Session continuity: If a session fails mid-way, re-run with --resume to continue from the last checkpoint
CLAUDE.md as persistent memory: The agent re-reads CLAUDE.md after compaction, so project instructions survive context resets

Cloud Workflows Error Handling¶

# factory/orchestration/factory.yaml — with proper error handling
main:
  params: [args]
  steps:
    - init:
        assign:
          - job: ${args.job}
          - job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}
          - project_id: ${sys.get_env("GCP_PROJECT_ID")}

    - notify_start:
        call: http.post
        args:
          url: ${sys.get_env("SLACK_WEBHOOK_URL")}
          body:
            text: ${"Factory started - " + job_label}

    - start_vm:
        try:
          call: googleapis.compute.v1.instances.start
          args:
            project: ${project_id}
            zone: europe-west9-a
            instance: rose-agent-runner
        retry:
          predicate: ${default_retry_predicate}
          max_retries: 3
          backoff:
            initial_delay: 2
            max_delay: 30
            multiplier: 2

    - wait_for_vm:
        call: sys.sleep
        args:
          seconds: 30

    - run_agent:
        try:
          call: http.post
          args:
            url: ${"https://compute.googleapis.com/compute/v1/projects/" + project_id + "/zones/europe-west9-a/instances/rose-agent-runner/runCommand"}
            auth:
              type: OAuth2
            body:
              command: >
                cd /app && git fetch origin
                && docker compose run --rm factory
              env: ${args}
            timeout: 7200  # 2 hour timeout
          result: agent_result
        except:
          as: e
          steps:
            - notify_error:
                call: http.post
                args:
                  url: ${sys.get_env("SLACK_WEBHOOK_URL")}
                  body:
                    text: ${"Factory FAILED - " + job_label + " - Error: " + e.message}
            - stop_vm_on_error:
                call: googleapis.compute.v1.instances.stop
                args:
                  project: ${project_id}
                  zone: europe-west9-a
                  instance: rose-agent-runner
            - raise_error:
                raise: ${e}

    - notify_done:
        call: http.post
        args:
          url: ${sys.get_env("SLACK_WEBHOOK_URL")}
          body:
            text: ${"Factory done - " + job_label}

    - stop_vm:
        call: googleapis.compute.v1.instances.stop
        args:
          project: ${project_id}
          zone: europe-west9-a
          instance: rose-agent-runner

    - secrets_access:
        # Secrets are injected into the container via .env.agent
        # Generated at VM boot from Secret Manager
        assign:
          - note: "Secrets loaded from GCP Secret Manager at container start"

Multi-Agent Coordination¶

When running multiple factory instances in parallel:

Concern	Solution
Git conflicts	Each agent gets its own feature branch — never share branches
Shared config files	Agents work on isolated worktrees (`git worktree add`)
DB conflicts	Each `solve-linear-ticket` run gets its own Supabase branch
Resource contention	One agent per VM, or multiple VMs with distinct names
Merge conflicts on shared files	Detect hotspot files (routes, configs) — flag for human resolution

Rollback Strategy¶

Agent creates checkpoint commits after each sub-task
↓
Self-review fails → git reset to last good checkpoint
↓
Human review fails → git revert the entire branch
↓
Production issue → branch was never merged to main (always PR-based)

The key insight: agents should never push directly to develop or main. Always via feature branch + PR. The PR is the human approval gate.

Leveraging Existing Skills¶

Skill	Job	Role
`solve-linear-ticket`	`solve-linear-ticket`	Parse ticket into sub-issues, implement each
`test-backend` / `test-frontend`	`solve-linear-ticket`	Run tests after implementation
`mypy`	`solve-linear-ticket`	Type checking (mandatory per CLAUDE.md)
`create-pr`	`solve-linear-ticket`	Create the PR when done
`pr-fix`	`solve-linear-ticket`	Address review comments from human reviewer
`smoke`	`solve-linear-ticket`	Run smoke tests before commit
`architecture-review`	`solve-linear-ticket`	Part of self-review rubric
`simplify`	`solve-linear-ticket`	Review changed code for quality
`solve-chatbot-issues`	`solve-chatbot-issues`	Daily triage: detect chatbot issues, create fixes, open PRs
`onboard-client`	`onboard-client`	Full client onboarding workflow (config, skills, evals)
`create-client-skills`	`onboard-client`	Create/update skills and eval dataset for a client
`prompt-engineering`	`onboard-client`	Fix individual skill failures
`browse-playground`	Any	Visual verification via Playwright

Consequences¶

Positive¶

Cheap: Max subscription + spot VMs + Cloud Workflows = near-zero marginal cost per run
Three use cases covered: Coding tasks, client onboarding, and daily chatbot issue triage
Flexible triggers: Not locked to one system — Linear, GitHub, cron, CLI all work
Observable: Slack notifications + Langfuse traces = know what's happening without watching
Incremental: Each factory and each trigger can be built independently

Negative¶

Complexity: Cloud Workflows, triggers, notifications — more moving parts
Sonnet limitations: Some complex tasks may need Opus (self-review already uses it)
Token refresh: Max subscription OAuth tokens expire — need monitoring and a process to refresh via claude login
Trust calibration: Team needs to learn what each job handles well

Neutral¶

Cloud Workflows is a new GCP service for the team, but it's simple YAML
Adding a new job is just a new case in run.sh — minimal new logic
Existing CI/CD workflows are unaffected; the factory is additive

Alternatives Considered¶

1. Airflow / Cloud Composer¶

Rejected: $250+/mo minimum, complex setup, overkill for linear workflows. Cloud Workflows does the same job for pennies.

2. GitHub Actions as sole orchestrator¶

Possible but not required. GitHub Actions can trigger Cloud Workflows via gcloud CLI. Useful if we want GitHub issue labels to trigger runs. But the factory doesn't depend on GitHub — it's triggered by Cloud Workflows which can be invoked from anywhere.

3. Prefect / Temporal¶

Deferred: Good alternatives if Cloud Workflows proves too limited. Prefect has a nice UI and Python-native DAGs. Temporal is better for long-running stateful workflows. Both require self-hosted infrastructure or paid cloud tiers.

4. Opus for everything¶

Rejected: Opus is 5-10x slower and would burn through Max subscription limits faster. Sonnet 4.6 is good enough for implementation. Opus reserved for judgment calls (self-review scoring).

5. Separate repo for factory¶

Deferred: Co-location keeps factory evolving with codebase. Extract later if it grows.

6. Composio / CCPM / TSK / Ralphy¶

Same analysis as before — none provide the full loop we need. Ralphy remains an optional inner-loop wrapper.

Implementation Phases¶

Phase 1: Factory Infra (Week 1-2)¶

[ ] Create factory/ directory structure
[ ] Build Dockerfile (Claude Code + Playwright + Supabase CLI)
[ ] Create factory/run.sh job dispatcher
[ ] Set up GCE spot VM with Container-Optimized OS
[ ] Store secrets in GCP Secret Manager
[ ] Create notify.sh Slack helper
[ ] Test: run claude -p in container with Max subscription auth

Phase 2: First Job — solve-linear-ticket (Week 3-4)¶

[ ] Add solve-linear-ticket case to run.sh
[ ] Implement self-review loop (Sonnet implements, Opus reviews)
[ ] Create Cloud Workflow definition (factory.yaml)
[ ] Test: manually trigger with a simple Linear ticket

Phase 3: More Jobs (Week 5-6)¶

[ ] Add onboard-client case with 3 modes (full, eval-only, update-skills)
[ ] Add solve-chatbot-issues case
[ ] Test: full onboarding run for an existing client
[ ] Test: chatbot triage run

Phase 4: Triggers & Scheduling (Week 7-8)¶

[ ] Extend Linear webhook Cloud Function to trigger factory workflow
[ ] Set up Cloud Scheduler for daily solve-chatbot-issues and weekly onboard-client eval-only
[ ] Add manual CLI trigger scripts
[ ] Optional: GitHub repository_dispatch integration
[ ] Test: end-to-end from Linear label to PR

Phase 5: Supabase Branching (Week 9-10)¶

[ ] Enable Supabase Branching on the project
[ ] Create seed files (base.sql, auth-users.sql, etc.)
[ ] Add seed selection logic to solve-linear-ticket job
[ ] Test: agent creates and uses a Supabase branch

Phase 6: Polish & Observe (Week 11-12)¶

[ ] Add heartbeat notifications (every 15 min during long runs)
[ ] Build cost tracking (tokens per run via Langfuse API)
[ ] Run solve-linear-ticket on 5-10 real tickets, collect metrics
[ ] Run onboard-client on 3-5 clients, compare with manual results
[ ] Run solve-chatbot-issues for a week, assess triage quality
[ ] Tune self-review thresholds and max iterations

Method	Best for
Native sandbox (Seatbelt/bubblewrap)	Day-to-day dev
`--allowedTools` per-session whitelist	CI/CD with specific needs
`settings.json` allowlists	Team-wide policies
PreToolUse hooks	Custom business logic
`--permission-prompt-tool` (MCP)	Enterprise audit trails
`--permission-mode acceptEdits`	Moderate autonomy
Docker Sandbox + `--dangerously-skip-permissions`	Fully automated (our choice)

ADR: Agentic Development Pipeline¶

Status¶

Date¶

Context¶

Current State¶

Decision¶

Architecture Overview¶

Design Principles¶

The Factory¶

Local Development¶

Infrastructure¶

Container¶

Directory Structure¶

Authentication¶

GCE VM (one-time setup)¶

Orchestration — GCP Cloud Workflows¶

Triggers¶

Notifications & Supervision¶

Slack Notifications¶

Langfuse Observability¶

Cost Tracking¶

Supervision Dashboard¶

Supabase Database Branching¶

Harness Engineering Best Practices¶

CLI vs SDK — When to Use Which¶

Safety Layers (Defense in Depth)¶

Docker Container Hardening¶

Output Format & Error Handling¶

Cost Circuit Breakers¶

Heartbeat & Liveness Monitoring¶

Long-Running Agent Sessions¶

Cloud Workflows Error Handling¶

Multi-Agent Coordination¶

Rollback Strategy¶

Leveraging Existing Skills¶

Consequences¶

Positive¶

Negative¶

Neutral¶

Alternatives Considered¶

1. Airflow / Cloud Composer¶

2. GitHub Actions as sole orchestrator¶

3. Prefect / Temporal¶

4. Opus for everything¶

5. Separate repo for factory¶

6. Composio / CCPM / TSK / Ralphy¶

Implementation Phases¶

Phase 1: Factory Infra (Week 1-2)¶

Phase 2: First Job — solve-linear-ticket (Week 3-4)¶

Phase 3: More Jobs (Week 5-6)¶

Phase 4: Triggers & Scheduling (Week 7-8)¶

Phase 5: Supabase Branching (Week 9-10)¶

Phase 6: Polish & Observe (Week 11-12)¶

References¶

Core¶

Infrastructure¶

Integrations¶

Community & Tools¶