Skip to content

ADR: Agentic Development Pipeline

Status

Draft

Date

2026-03-04

Context

As Rose grows in scope (Website Agent, Nurturing Agent, and future agents), the team needs to scale development velocity without scaling headcount linearly. The current workflow is manual: a developer picks a Linear ticket, writes code, runs tests, creates a PR, and waits for review. This is effective but creates bottlenecks when multiple features are in flight.

Inspired by Sam Hessanower's agentic dev pipeline (AI Tinkerers / OneShot, Aug 2024), we want to explore autonomous AI agents that can take a Linear ticket from spec to PR — writing code, running tests, self-reviewing, and surfacing the result for human judgment.

Current State

What we already have that supports this:

Capability Current State
Linear integration MCP server configured, linearis CLI, Linear webhook Cloud Function
Claude Code skills 30+ skills (solve-linear-ticket, test-backend, test-frontend, mypy, etc.)
Claude Code settings Permissions, hooks, MCP servers (Linear, Playwright, Figma)
CI/CD GitHub Actions for backend/frontend/docs deploy, PR checks, security scans
Testing pytest (unit/integration/smoke/evaluation markers), Vitest, Playwright
Infrastructure GCP Cloud Run, Terraform, Docker multi-stage builds
Supabase DB, auth, storage, migrations, edge functions
CLAUDE.md Comprehensive codebase instructions for AI agents
Pre-commit hooks Ruff, gitleaks, smoke tests on commit

What's missing:

Gap Impact
No autonomous agent loop Agent can't self-correct without human intervention
No isolated DB per agent run Agents can't safely test DB-dependent features
No containerized agent runtime Can't run agents in parallel, safely, reproducibly
No self-review rubric No structured quality gate before human review
No Linear → agent trigger Human must manually start the agent

Decision

Build one autonomous agent factory that runs different jobs (coding tasks, client onboarding, chatbot issue triage) on GCE spot VMs with Claude Sonnet 4.6 (cheap, fast, good enough). Orchestrated by GCP Cloud Workflows (serverless, native, pennies per run). Triggers are flexible: Linear webhooks, GitHub events, manual CLI, or scheduled cron. Human supervision via Slack notifications + Langfuse observability.

Architecture Overview

The factory is a single container that dispatches to different jobs based on the trigger payload:

flowchart TB classDef trigger fill:#5E6AD2,stroke:#4854c7,color:#fff classDef orchestrator fill:#24292f,stroke:#1b1f23,color:#fff classDef agent fill:#D97706,stroke:#b45309,color:#fff classDef review fill:#059669,stroke:#047857,color:#fff classDef notif fill:#DC2626,stroke:#b91c1c,color:#fff subgraph TRIGGERS["Triggers (flexible)"] LIN["Linear Webhook<br/>ai-ready label"]:::trigger GH["GitHub Event<br/>issue/dispatch"]:::trigger CRON["Cloud Scheduler<br/>daily / weekly"]:::trigger CLI["Manual CLI<br/>gcloud / script"]:::trigger end CW["Cloud Workflows<br/>(orchestrator)"]:::orchestrator subgraph GCE["GCE VM (e2-medium, spot)"] direction TB DISPATCH["factory/run.sh<br/>job dispatcher"]:::agent subgraph JOBS["Jobs (same container, different prompts)"] direction LR J1["solve-linear-ticket"]:::agent J2["onboard-client"]:::agent J3["solve-chatbot-issues"]:::agent end REVIEW["Self-review loop<br/>(Opus scores)"]:::review end SLACK["Slack Notifications"]:::notif LF["Langfuse Observability"]:::notif TRIGGERS --> CW --> GCE DISPATCH --> JOBS J1 --> REVIEW REVIEW -->|"score < 4"| J1 REVIEW -->|"scores ≥ 4"| PR["Draft PR"]:::agent GCE --> SLACK GCE --> LF

Design Principles

Principle Decision
Model Claude Sonnet 4.6 (--model sonnet) — fast, cheap, good enough for most tasks. Opus only for self-review scoring
Auth Max subscription via ANTHROPIC_AUTH_TOKEN — $0/run, token refreshed from GCP Secret Manager
Orchestration GCP Cloud Workflows — serverless, ~$0.001/execution, native GCP
Triggers Flexible: Linear, GitHub, cron, manual CLI. Not locked to one source
Notifications Slack webhooks for progress/completion/failure
Supervision Langfuse traces for every agent run, cost tracking per run
Compute GCE e2-medium spot — ~$0.01/hr, no time limits, full browser support

The Factory

One container, one entrypoint, different jobs. The job parameter determines which skill to run and with what params.

Entrypoint:

#!/usr/bin/env bash
# factory/run.sh — single entrypoint, dispatches by job type
set -euo pipefail

JOB="${JOB:?required}"

case "$JOB" in
  solve-linear-ticket)
    TICKET_ID="${TICKET_ID:?required}"
    BRANCH="${BRANCH:?required}"
    claude -p "Use the solve-linear-ticket skill to implement ${TICKET_ID}.
      Follow CLAUDE.md strictly. Run tests after each change.
      Create commits for each sub-issue." \
      --model sonnet --output-format json --max-turns 100 \
      --dangerously-skip-permissions

    # Self-review (Opus for better judgment)
    claude -p "Use the requesting-code-review skill to review all changes on this branch.
      Score each criterion 0-5. Output JSON." \
      --model opus --output-format json --max-turns 20 \
      --dangerously-skip-permissions
    ;;

  onboard-client)
    CLIENT_DOMAIN="${CLIENT_DOMAIN:?required}"
    MODE="${MODE:-full}"  # full | eval-only | update-skills
    claude -p "Use the onboard-client skill for ${CLIENT_DOMAIN}.
      Environment: staging. Use --local-prompt for all rose-chat and rose-eval calls.
      Target accuracy: 0.70. Max 3 eval iterations.
      Mode: ${MODE}." \
      --model sonnet --output-format json --max-turns 150 \
      --dangerously-skip-permissions
    ;;

  solve-chatbot-issues)
    claude -p "Use the solve-chatbot-issues skill.
      Check Sentry and Langfuse for recent chatbot failures.
      Triage, fix, and open PRs for each issue." \
      --model sonnet --output-format json --max-turns 100 \
      --dangerously-skip-permissions
    ;;

  *)
    echo "Unknown job: $JOB"
    exit 1
    ;;
esac

Jobs:

Job Skill Trigger What It Does
solve-linear-ticket solve-linear-ticket Linear ai-ready (non-Onboarding projects), manual CLI Ticket → code → tests → self-review → PR
onboard-client onboard-client Linear ai-ready (Onboarding project), manual CLI, KB update Config + skills + eval dataset + accuracy tuning
solve-chatbot-issues solve-chatbot-issues Daily cron Triage chatbot issues from Sentry/Langfuse, open fix PRs

Self-review loop (for solve-linear-ticket):

repeat (max 3 iterations):
  Opus scores each criterion (0-5)
  if any score < 4:
    Sonnet fixes the issues
    re-run tests
  else:
    create/update PR, tag 'agent-ready'
    notify Slack
    exit

Invocation examples:

# Coding task from Linear ticket
gcloud workflows run factory --data='{"job":"solve-linear-ticket","ticket_id":"abc123","ticket_identifier":"IX-123"}'

# Client onboarding
gcloud workflows run factory --data='{"job":"onboard-client","client_domain":"acme.com","mode":"full"}'

# Daily chatbot triage (triggered by Cloud Scheduler)
gcloud workflows run factory --data='{"job":"solve-chatbot-issues"}'

Local Development

A CLI wrapper (rose-factory) makes it easy to run jobs locally in Docker:

#!/usr/bin/env bash
# rose-factory — run factory jobs locally
set -euo pipefail

usage() {
  echo "Usage: rose-factory <job> [options]"
  echo ""
  echo "Jobs:"
  echo "  solve-linear-ticket  --ticket IX-123"
  echo "  onboard-client       --domain acme.com [--mode full|eval-only|update-skills]"
  echo "  solve-chatbot-issues"
  echo ""
  echo "Options:"
  echo "  --no-docker    Run directly without Docker (requires Claude CLI installed)"
  exit 1
}

[[ $# -lt 1 ]] && usage

JOB="$1"; shift
NO_DOCKER=false
EXTRA_ENV=()

while [[ $# -gt 0 ]]; do
  case "$1" in
    --ticket)     EXTRA_ENV+=(-e "TICKET_ID=$2" -e "BRANCH=feature/$2"); shift 2 ;;
    --domain)     EXTRA_ENV+=(-e "CLIENT_DOMAIN=$2"); shift 2 ;;
    --mode)       EXTRA_ENV+=(-e "MODE=$2"); shift 2 ;;
    --no-docker)  NO_DOCKER=true; shift ;;
    *)            echo "Unknown option: $1"; usage ;;
  esac
done

if [ "$NO_DOCKER" = true ]; then
  # Run directly — extract env vars from EXTRA_ENV flags
  for e in "${EXTRA_ENV[@]}"; do
    [[ "$e" == "-e" ]] && continue
    export "$e"
  done
  export JOB
  exec ./factory/run.sh
fi

docker compose -f factory/docker-compose.yml run --rm \
  -e "JOB=$JOB" "${EXTRA_ENV[@]}" \
  factory

Examples:

# Run a Linear ticket locally in Docker
rose-factory solve-linear-ticket --ticket IX-123

# Onboard a client without Docker
rose-factory onboard-client --domain acme.com --mode full --no-docker

# Chatbot triage
rose-factory solve-chatbot-issues

Infrastructure

Container

FROM node:22-bookworm

RUN npm install -g @anthropic-ai/claude-code
RUN npx playwright install --with-deps chromium
RUN npm install -g supabase
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . .
# factory/docker-compose.yml
services:
  factory:
    build: .
    entrypoint: ["./factory/run.sh"]
    env_file: .env.agent

Directory Structure

factory/
├── Dockerfile                # Container image (Claude Code + Playwright + Supabase CLI)
├── docker-compose.yml        # Single factory service
├── run.sh                    # Job dispatcher entrypoint
├── self-review/
│   ├── rubric.md             # Scoring criteria
│   └── review-loop.sh        # Self-review orchestrator
├── orchestration/
│   ├── factory.yaml          # Cloud Workflow definition
│   └── notify.sh             # Slack notification helper
└── seeds/
    ├── base.sql              # Empty schema, minimal seed
    ├── auth-users.sql        # Admin + regular user with tokens
    ├── content-populated.sql # Realistic content volume
    └── multi-tenant.sql      # 2+ orgs with separate data

Authentication

Max subscription auth via ANTHROPIC_AUTH_TOKEN — $0/run (included in subscription).

# Generate a long-lived token and store it in GCP Secret Manager
./scripts/secrets/refresh-claude-token.sh

The script runs claude setup-token (opens browser), then prompts you to paste the token and stores it in Secret Manager. The token is injected into Docker containers at runtime via env var:

docker run \
  -e ANTHROPIC_AUTH_TOKEN="$(gcloud secrets versions access latest --secret=CLAUDE_AUTH_TOKEN)" \
  rose-agent claude -p "..."

Note: claude -p inside Docker is the official CLI binary, so subscription OAuth works. Token expires periodically — refresh by re-running the script.

All secrets injected via GCP Secret Manager:

Secret Used By
CLAUDE_AUTH_TOKEN Max subscription auth ($0/run)
LINEAR_API_KEY Linear MCP server
GITHUB_TOKEN Git push, PR creation
SLACK_WEBHOOK_URL Notifications
LANGFUSE_SECRET_KEY LLM observability

GCE VM (one-time setup)

gcloud compute instances create rose-agent-runner \
  --zone=europe-west9-a \
  --machine-type=e2-medium \
  --provisioning-model=SPOT \
  --instance-termination-action=STOP \
  --boot-disk-size=50GB \
  --image-family=cos-stable \
  --image-project=cos-cloud

Orchestration — GCP Cloud Workflows

Why Cloud Workflows (not Airflow):

Cloud Workflows Cloud Composer (Airflow)
Cost ~$0.001/execution ~$250/mo minimum (GKE + Cloud SQL)
Setup YAML file, deploy in 1 min 30+ min environment creation
Long-running Up to 1 year timeout Unlimited
Maintenance Zero (serverless) Cluster upgrades, DAG syncing
Fit Small team, simple flows Large team, complex DAGs

Cloud Composer is overkill for our use case. Cloud Workflows is serverless, costs pennies, and handles the linear flow (trigger → start VM → run container → notify) perfectly.

Factory workflow (single workflow, job dispatched via params):

# factory/orchestration/factory.yaml
main:
  params: [args]
  steps:
    - init:
        assign:
          - job: ${args.job}
          - job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}

    - notify_start:
        call: http.post
        args:
          url: ${SLACK_WEBHOOK_URL}
          body:
            text: "🏭 Factory started: ${job_label}"

    - start_vm:
        call: googleapis.compute.v1.instances.start
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner

    - run_agent:
        call: googleapis.compute.v1.instances.runCommand
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner
          command: >
            cd /app && git fetch origin
            && docker compose run --rm factory
          env: ${args}
        result: agent_result

    - notify_done:
        call: http.post
        args:
          url: ${SLACK_WEBHOOK_URL}
          body:
            text: "✅ Factory done: ${job_label}"

    - stop_vm:
        call: googleapis.compute.v1.instances.stop
        args:
          project: ${GCP_PROJECT_ID}
          zone: europe-west9-a
          instance: rose-agent-runner

  except:
    as: e
    steps:
      - notify_failure:
          call: http.post
          args:
            url: ${SLACK_WEBHOOK_URL}
            body:
              text: "🚨 Factory FAILED: ${job_label}\nError: ${e.message}"
      - stop_vm_on_error:
          call: googleapis.compute.v1.instances.stop
          args:
            project: ${GCP_PROJECT_ID}
            zone: europe-west9-a
            instance: rose-agent-runner
      - raise_error:
          raise: ${e}

Triggers

Triggers are decoupled from jobs — any trigger can invoke any job via Cloud Workflows.

Trigger Source Job How
Linear ai-ready label Linear webhook → Cloud Function Routed by project Cloud Function maps project → job (see below)
GitHub repository_dispatch GitHub webhook solve-linear-ticket Optional — if we want GitHub issues to trigger runs
Manual CLI Developer's terminal Any gcloud workflows run factory --data='{"job":"...","ticket_id":"IX-123"}'
Daily cron Cloud Scheduler solve-chatbot-issues Daily triage of chatbot issues from Sentry/Langfuse
Weekly cron Cloud Scheduler onboard-client Scheduled eval-only runs for all active clients
KB update Document upload webhook onboard-client update-skills mode for the changed client

Linear ai-ready routing: One label, one webhook — the Cloud Function routes by Linear project:

Linear Project Job
"Onboarding" onboard-client (extracts client domain from ticket)
Any other project solve-linear-ticket (default)

Why triggers are flexible (not locked to Linear or GitHub):

  • Same ai-ready label works across all projects — routing is automatic
  • Sometimes you just want to kick off a run from the CLI without creating a ticket
  • Cron-based jobs (chatbot triage, re-evaluation) don't map to any issue tracker event
  • GitHub integration is nice-to-have (auto-trigger from issue labels) but not required

Notifications & Supervision

Slack Notifications

Every factory run posts to a #agent-factory Slack channel:

Event Message
Run started Factory type, ticket/client, branch
Progress update Every 15 min during long runs (heartbeat)
Self-review scores Per-criterion scores, pass/fail
Run completed PR link or accuracy report
Run failed Error summary, link to logs

Implemented via a simple notify.sh helper that posts to a Slack webhook.

Langfuse Observability

Every claude -p call is traced in Langfuse (already configured in the project). This gives:

  • Token usage per run — track cost per ticket / per client
  • Latency breakdown — how long each agent turn takes
  • Prompt versions — which skill versions were used
  • Failure analysis — inspect the full agent conversation when something goes wrong

Cost Tracking

Component Cost
Claude Sonnet 4.6 (Max subscription) $0/run (included in subscription)
GCE e2-medium spot ~$0.01/hr (~$0.003–0.04/run)
Cloud Workflows ~$0.001/execution
Slack webhook Free
Langfuse Free tier (50k traces/mo)
Total per run ~$0.01–0.05 compute only

The Max subscription makes this extremely cheap — the main cost is just the VM compute time. Still use --max-turns to prevent runaway loops that waste subscription quota.

Supervision Dashboard

No custom dashboard needed initially. Use existing tools:

  • Langfuse → LLM traces, cost, latency
  • Cloud Logging → VM and container logs
  • Slack → real-time notifications
  • GitHub → PR status, CI checks
  • Linear → ticket status updates

If we need a consolidated view later, build a simple Cloud Monitoring dashboard.


Supabase Database Branching

Goal: Each solve-linear-ticket run gets its own isolated database.

Use Supabase Branching to create ephemeral DB branches per agent run:

# supabase/config.toml
[db.seed]
enabled = true
sql_paths = ["./seed.sql", "./seeds/*.sql"]

Seed strategy:

Seed File Contents When to Use
base.sql Schema + minimal reference data Default for all runs
auth-users.sql Admin + regular users with mock tokens Features touching auth
content-populated.sql Realistic volume for UI testing Frontend-heavy features
multi-tenant.sql Multiple orgs with separate data Multi-tenant features

The onboard-client and solve-chatbot-issues jobs don't need DB branching — they use the staging environment directly.


Harness Engineering Best Practices

Based on Anthropic's harness engineering guidance and the broader cloud agent ecosystem, these practices apply to all jobs.

CLI vs SDK — When to Use Which

CLI (claude -p) Agent SDK (@anthropic-ai/claude-agent-sdk)
When Scripts, CI/CD, existing skills, agent teams Custom agent loops, programmatic control, non-coding agents
Pros Zero code, reads .mcp.json + skills automatically, --resume for multi-turn Full event streaming, tool approval callbacks, session management
Cons Less control over agent loop Each user = separate subprocess, locked to Anthropic
Our choice CLI for all jobs — skills and MCP servers work automatically Consider SDK later for non-coding agents (legal review, data analysis)

The SDK spawns the same CLI process under the hood. For our use case, the CLI is simpler and gets all skills/MCP servers for free.

Harness quality matters: Anthropic's benchmarks show the same Opus model scoring 78% with Claude Code's harness vs 42% with a naive harness on CORE benchmark — a 36-point gap from harness engineering alone.

Safety Layers (Defense in Depth)

--dangerously-skip-permissions is required in headless mode (no human to click "Allow"), but we mitigate risk through layered defenses:

Layer 1: Container isolation (microVM or Docker with no host socket)
Layer 2: Network allow/deny lists (only approved endpoints)
Layer 3: --allowedTools whitelist (restrict available tools)
Layer 4: PreToolUse hooks (block specific dangerous patterns)
Layer 5: Filesystem boundaries (agent can only write to workspace)
Layer 6: Cost circuit breakers (kill after $X or N errors)

Concrete implementation:

# Safer than naked --dangerously-skip-permissions:
# Note: --allowedTools uses permission rule syntax (space before * for prefix match)
claude -p "..." \
  --model sonnet \
  --max-turns 100 \
  --output-format stream-json \
  --allowedTools "Read,Write,Edit,Glob,Grep,Bash(git *),Bash(npm run *),Bash(just *),mcp__playwright *" \
  --dangerously-skip-permissions

Alternatives to --dangerously-skip-permissions (from safest to most permissive):

Method Best for
Native sandbox (Seatbelt/bubblewrap) Day-to-day dev
--allowedTools per-session whitelist CI/CD with specific needs
settings.json allowlists Team-wide policies
PreToolUse hooks Custom business logic
--permission-prompt-tool (MCP) Enterprise audit trails
--permission-mode acceptEdits Moderate autonomy
Docker Sandbox + --dangerously-skip-permissions Fully automated (our choice)

PreToolUse hooks as guardrails (hooks fire even in bypass mode):

// .claude/settings.json — block dangerous patterns
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "echo $CLAUDE_TOOL_INPUT | python3 -c \"import sys,json; cmd=json.load(sys.stdin)['command']; sys.exit(1 if any(w in cmd for w in ['rm -rf /','docker socket','curl.*|.*sh','wget.*|.*bash']) else 0)\""
        }]
      }
    ]
  }
}

Docker Container Hardening

FROM node:22-bookworm

# Claude Code CLI
RUN npm install -g @anthropic-ai/claude-code

# Playwright with Chromium
RUN npx playwright install --with-deps chromium

# Supabase CLI for DB branching
RUN npm install -g supabase

# Python tooling
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*

# Non-root user for safety
RUN useradd -m agent
USER agent

WORKDIR /app
COPY --chown=agent:agent . .

Critical rules:

Rule Why
Never mount /var/run/docker.sock Agent could escape sandbox via Docker API
Use --network=none or allow-list Prevent data exfiltration via prompt injection
Non-root user inside container Limit blast radius of filesystem operations
No host volume mounts for secrets Use env vars from Secret Manager instead
Set resource limits --memory=4g --cpus=2 to prevent resource exhaustion
# factory/docker-compose.yml (hardened)
services:
  factory:
    build: .
    entrypoint: ["./factory/run.sh"]
    env_file: .env.agent
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2"
    networks:
      - agent-net
    security_opt:
      - no-new-privileges:true
    read_only: true
    tmpfs:
      - /tmp

networks:
  agent-net:
    driver: bridge
    # Allow-list outbound traffic via iptables rules on the host

Output Format & Error Handling

Use stream-json for real-time monitoring, json for structured results:

# Stream events in real-time (for heartbeat/progress monitoring)
claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
  TYPE=$(echo "$line" | jq -r '.type // empty')
  case "$TYPE" in
    "assistant")
      # Agent is producing output — heartbeat is alive
      ./factory/orchestration/notify.sh heartbeat "$TICKET_ID"
      ;;
    "result")
      # Final result — extract cost and session info
      COST=$(echo "$line" | jq -r '.cost_usd')
      TURNS=$(echo "$line" | jq -r '.num_turns')
      ./factory/orchestration/notify.sh completed "$TICKET_ID" "$COST" "$TURNS"
      ;;
    "error")
      ./factory/orchestration/notify.sh failed "$TICKET_ID" "$(echo "$line" | jq -r '.error')"
      ;;
  esac
done

# Check exit code
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
  ./factory/orchestration/notify.sh failed "$TICKET_ID" "Exit code: $EXIT_CODE"
fi

Cost Circuit Breakers

Prevent runaway agent spending:

# factory/code/run.sh — with circuit breakers
MAX_COST_USD=10.00
MAX_CONSECUTIVE_ERRORS=5
MAX_DURATION_SECONDS=7200  # 2 hours

STARTED_AT=$(date +%s)

claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
  # Check duration
  ELAPSED=$(( $(date +%s) - STARTED_AT ))
  if [ "$ELAPSED" -gt "$MAX_DURATION_SECONDS" ]; then
    echo "CIRCUIT BREAKER: Duration exceeded ${MAX_DURATION_SECONDS}s"
    kill %1  # Kill the claude process
    exit 1
  fi

  # Check cost (from stream-json events)
  COST=$(echo "$line" | jq -r '.cost_usd // 0')
  if (( $(echo "$COST > $MAX_COST_USD" | bc -l) )); then
    echo "CIRCUIT BREAKER: Cost exceeded \$${MAX_COST_USD}"
    kill %1
    exit 1
  fi
done

Heartbeat & Liveness Monitoring

# factory/orchestration/heartbeat.sh — background process
HEARTBEAT_INTERVAL=900  # 15 minutes
HEARTBEAT_TIMEOUT=1800  # 30 minutes — agent is stuck if no output for this long

LAST_ACTIVITY=$(date +%s)

monitor_heartbeat() {
  while true; do
    sleep "$HEARTBEAT_INTERVAL"
    IDLE_TIME=$(( $(date +%s) - LAST_ACTIVITY ))

    if [ "$IDLE_TIME" -gt "$HEARTBEAT_TIMEOUT" ]; then
      ./factory/orchestration/notify.sh stuck "$TICKET_ID" "No activity for ${IDLE_TIME}s"
      # Don't kill — human decides
    else
      ./factory/orchestration/notify.sh heartbeat "$TICKET_ID" "Running (${IDLE_TIME}s since last activity)"
    fi
  done
}

Long-Running Agent Sessions

Per Anthropic's guidance, context windows are limited and complex projects can't complete in a single session. Claude Code handles this via compaction — automatically summarizing context when approaching limits. Key practices:

  • Let compaction work: Don't set --max-turns too low. Sonnet can do 100+ turns with compaction
  • Checkpoint commits: Configure the agent to commit after each sub-task, creating rollback points
  • Session continuity: If a session fails mid-way, re-run with --resume to continue from the last checkpoint
  • CLAUDE.md as persistent memory: The agent re-reads CLAUDE.md after compaction, so project instructions survive context resets

Cloud Workflows Error Handling

# factory/orchestration/factory.yaml — with proper error handling
main:
  params: [args]
  steps:
    - init:
        assign:
          - job: ${args.job}
          - job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}
          - project_id: ${sys.get_env("GCP_PROJECT_ID")}

    - notify_start:
        call: http.post
        args:
          url: ${sys.get_env("SLACK_WEBHOOK_URL")}
          body:
            text: ${"Factory started - " + job_label}

    - start_vm:
        try:
          call: googleapis.compute.v1.instances.start
          args:
            project: ${project_id}
            zone: europe-west9-a
            instance: rose-agent-runner
        retry:
          predicate: ${default_retry_predicate}
          max_retries: 3
          backoff:
            initial_delay: 2
            max_delay: 30
            multiplier: 2

    - wait_for_vm:
        call: sys.sleep
        args:
          seconds: 30

    - run_agent:
        try:
          call: http.post
          args:
            url: ${"https://compute.googleapis.com/compute/v1/projects/" + project_id + "/zones/europe-west9-a/instances/rose-agent-runner/runCommand"}
            auth:
              type: OAuth2
            body:
              command: >
                cd /app && git fetch origin
                && docker compose run --rm factory
              env: ${args}
            timeout: 7200  # 2 hour timeout
          result: agent_result
        except:
          as: e
          steps:
            - notify_error:
                call: http.post
                args:
                  url: ${sys.get_env("SLACK_WEBHOOK_URL")}
                  body:
                    text: ${"Factory FAILED - " + job_label + " - Error: " + e.message}
            - stop_vm_on_error:
                call: googleapis.compute.v1.instances.stop
                args:
                  project: ${project_id}
                  zone: europe-west9-a
                  instance: rose-agent-runner
            - raise_error:
                raise: ${e}

    - notify_done:
        call: http.post
        args:
          url: ${sys.get_env("SLACK_WEBHOOK_URL")}
          body:
            text: ${"Factory done - " + job_label}

    - stop_vm:
        call: googleapis.compute.v1.instances.stop
        args:
          project: ${project_id}
          zone: europe-west9-a
          instance: rose-agent-runner

    - secrets_access:
        # Secrets are injected into the container via .env.agent
        # Generated at VM boot from Secret Manager
        assign:
          - note: "Secrets loaded from GCP Secret Manager at container start"

Multi-Agent Coordination

When running multiple factory instances in parallel:

Concern Solution
Git conflicts Each agent gets its own feature branch — never share branches
Shared config files Agents work on isolated worktrees (git worktree add)
DB conflicts Each solve-linear-ticket run gets its own Supabase branch
Resource contention One agent per VM, or multiple VMs with distinct names
Merge conflicts on shared files Detect hotspot files (routes, configs) — flag for human resolution

Rollback Strategy

Agent creates checkpoint commits after each sub-task
Self-review fails → git reset to last good checkpoint
Human review fails → git revert the entire branch
Production issue → branch was never merged to main (always PR-based)

The key insight: agents should never push directly to develop or main. Always via feature branch + PR. The PR is the human approval gate.


Leveraging Existing Skills

Skill Job Role
solve-linear-ticket solve-linear-ticket Parse ticket into sub-issues, implement each
test-backend / test-frontend solve-linear-ticket Run tests after implementation
mypy solve-linear-ticket Type checking (mandatory per CLAUDE.md)
create-pr solve-linear-ticket Create the PR when done
pr-fix solve-linear-ticket Address review comments from human reviewer
smoke solve-linear-ticket Run smoke tests before commit
architecture-review solve-linear-ticket Part of self-review rubric
simplify solve-linear-ticket Review changed code for quality
solve-chatbot-issues solve-chatbot-issues Daily triage: detect chatbot issues, create fixes, open PRs
onboard-client onboard-client Full client onboarding workflow (config, skills, evals)
create-client-skills onboard-client Create/update skills and eval dataset for a client
prompt-engineering onboard-client Fix individual skill failures
browse-playground Any Visual verification via Playwright

Consequences

Positive

  • Cheap: Max subscription + spot VMs + Cloud Workflows = near-zero marginal cost per run
  • Three use cases covered: Coding tasks, client onboarding, and daily chatbot issue triage
  • Flexible triggers: Not locked to one system — Linear, GitHub, cron, CLI all work
  • Observable: Slack notifications + Langfuse traces = know what's happening without watching
  • Incremental: Each factory and each trigger can be built independently

Negative

  • Complexity: Cloud Workflows, triggers, notifications — more moving parts
  • Sonnet limitations: Some complex tasks may need Opus (self-review already uses it)
  • Token refresh: Max subscription OAuth tokens expire — need monitoring and a process to refresh via claude login
  • Trust calibration: Team needs to learn what each job handles well

Neutral

  • Cloud Workflows is a new GCP service for the team, but it's simple YAML
  • Adding a new job is just a new case in run.sh — minimal new logic
  • Existing CI/CD workflows are unaffected; the factory is additive

Alternatives Considered

1. Airflow / Cloud Composer

Rejected: $250+/mo minimum, complex setup, overkill for linear workflows. Cloud Workflows does the same job for pennies.

2. GitHub Actions as sole orchestrator

Possible but not required. GitHub Actions can trigger Cloud Workflows via gcloud CLI. Useful if we want GitHub issue labels to trigger runs. But the factory doesn't depend on GitHub — it's triggered by Cloud Workflows which can be invoked from anywhere.

3. Prefect / Temporal

Deferred: Good alternatives if Cloud Workflows proves too limited. Prefect has a nice UI and Python-native DAGs. Temporal is better for long-running stateful workflows. Both require self-hosted infrastructure or paid cloud tiers.

4. Opus for everything

Rejected: Opus is 5-10x slower and would burn through Max subscription limits faster. Sonnet 4.6 is good enough for implementation. Opus reserved for judgment calls (self-review scoring).

5. Separate repo for factory

Deferred: Co-location keeps factory evolving with codebase. Extract later if it grows.

6. Composio / CCPM / TSK / Ralphy

Same analysis as before — none provide the full loop we need. Ralphy remains an optional inner-loop wrapper.

Implementation Phases

Phase 1: Factory Infra (Week 1-2)

  • [ ] Create factory/ directory structure
  • [ ] Build Dockerfile (Claude Code + Playwright + Supabase CLI)
  • [ ] Create factory/run.sh job dispatcher
  • [ ] Set up GCE spot VM with Container-Optimized OS
  • [ ] Store secrets in GCP Secret Manager
  • [ ] Create notify.sh Slack helper
  • [ ] Test: run claude -p in container with Max subscription auth

Phase 2: First Job — solve-linear-ticket (Week 3-4)

  • [ ] Add solve-linear-ticket case to run.sh
  • [ ] Implement self-review loop (Sonnet implements, Opus reviews)
  • [ ] Create Cloud Workflow definition (factory.yaml)
  • [ ] Test: manually trigger with a simple Linear ticket

Phase 3: More Jobs (Week 5-6)

  • [ ] Add onboard-client case with 3 modes (full, eval-only, update-skills)
  • [ ] Add solve-chatbot-issues case
  • [ ] Test: full onboarding run for an existing client
  • [ ] Test: chatbot triage run

Phase 4: Triggers & Scheduling (Week 7-8)

  • [ ] Extend Linear webhook Cloud Function to trigger factory workflow
  • [ ] Set up Cloud Scheduler for daily solve-chatbot-issues and weekly onboard-client eval-only
  • [ ] Add manual CLI trigger scripts
  • [ ] Optional: GitHub repository_dispatch integration
  • [ ] Test: end-to-end from Linear label to PR

Phase 5: Supabase Branching (Week 9-10)

  • [ ] Enable Supabase Branching on the project
  • [ ] Create seed files (base.sql, auth-users.sql, etc.)
  • [ ] Add seed selection logic to solve-linear-ticket job
  • [ ] Test: agent creates and uses a Supabase branch

Phase 6: Polish & Observe (Week 11-12)

  • [ ] Add heartbeat notifications (every 15 min during long runs)
  • [ ] Build cost tracking (tokens per run via Langfuse API)
  • [ ] Run solve-linear-ticket on 5-10 real tickets, collect metrics
  • [ ] Run onboard-client on 3-5 clients, compare with manual results
  • [ ] Run solve-chatbot-issues for a week, assess triage quality
  • [ ] Tune self-review thresholds and max iterations

References

Core

Infrastructure

Integrations

Community & Tools