ADR: Agentic Development Pipeline¶
Status¶
Draft
Date¶
2026-03-04
Context¶
As Rose grows in scope (Website Agent, Nurturing Agent, and future agents), the team needs to scale development velocity without scaling headcount linearly. The current workflow is manual: a developer picks a Linear ticket, writes code, runs tests, creates a PR, and waits for review. This is effective but creates bottlenecks when multiple features are in flight.
Inspired by Sam Hessanower's agentic dev pipeline (AI Tinkerers / OneShot, Aug 2024), we want to explore autonomous AI agents that can take a Linear ticket from spec to PR — writing code, running tests, self-reviewing, and surfacing the result for human judgment.
Current State¶
What we already have that supports this:
| Capability | Current State |
|---|---|
| Linear integration | MCP server configured, linearis CLI, Linear webhook Cloud Function |
| Claude Code skills | 30+ skills (solve-linear-ticket, test-backend, test-frontend, mypy, etc.) |
| Claude Code settings | Permissions, hooks, MCP servers (Linear, Playwright, Figma) |
| CI/CD | GitHub Actions for backend/frontend/docs deploy, PR checks, security scans |
| Testing | pytest (unit/integration/smoke/evaluation markers), Vitest, Playwright |
| Infrastructure | GCP Cloud Run, Terraform, Docker multi-stage builds |
| Supabase | DB, auth, storage, migrations, edge functions |
| CLAUDE.md | Comprehensive codebase instructions for AI agents |
| Pre-commit hooks | Ruff, gitleaks, smoke tests on commit |
What's missing:
| Gap | Impact |
|---|---|
| No autonomous agent loop | Agent can't self-correct without human intervention |
| No isolated DB per agent run | Agents can't safely test DB-dependent features |
| No containerized agent runtime | Can't run agents in parallel, safely, reproducibly |
| No self-review rubric | No structured quality gate before human review |
| No Linear → agent trigger | Human must manually start the agent |
Decision¶
Build one autonomous agent factory that runs different jobs (coding tasks, client onboarding, chatbot issue triage) on GCE spot VMs with Claude Sonnet 4.6 (cheap, fast, good enough). Orchestrated by GCP Cloud Workflows (serverless, native, pennies per run). Triggers are flexible: Linear webhooks, GitHub events, manual CLI, or scheduled cron. Human supervision via Slack notifications + Langfuse observability.
Architecture Overview¶
The factory is a single container that dispatches to different jobs based on the trigger payload:
Design Principles¶
| Principle | Decision |
|---|---|
| Model | Claude Sonnet 4.6 (--model sonnet) — fast, cheap, good enough for most tasks. Opus only for self-review scoring |
| Auth | Max subscription via ANTHROPIC_AUTH_TOKEN — $0/run, token refreshed from GCP Secret Manager |
| Orchestration | GCP Cloud Workflows — serverless, ~$0.001/execution, native GCP |
| Triggers | Flexible: Linear, GitHub, cron, manual CLI. Not locked to one source |
| Notifications | Slack webhooks for progress/completion/failure |
| Supervision | Langfuse traces for every agent run, cost tracking per run |
| Compute | GCE e2-medium spot — ~$0.01/hr, no time limits, full browser support |
The Factory¶
One container, one entrypoint, different jobs. The job parameter determines which skill to run and with what params.
Entrypoint:
#!/usr/bin/env bash
# factory/run.sh — single entrypoint, dispatches by job type
set -euo pipefail
JOB="${JOB:?required}"
case "$JOB" in
solve-linear-ticket)
TICKET_ID="${TICKET_ID:?required}"
BRANCH="${BRANCH:?required}"
claude -p "Use the solve-linear-ticket skill to implement ${TICKET_ID}.
Follow CLAUDE.md strictly. Run tests after each change.
Create commits for each sub-issue." \
--model sonnet --output-format json --max-turns 100 \
--dangerously-skip-permissions
# Self-review (Opus for better judgment)
claude -p "Use the requesting-code-review skill to review all changes on this branch.
Score each criterion 0-5. Output JSON." \
--model opus --output-format json --max-turns 20 \
--dangerously-skip-permissions
;;
onboard-client)
CLIENT_DOMAIN="${CLIENT_DOMAIN:?required}"
MODE="${MODE:-full}" # full | eval-only | update-skills
claude -p "Use the onboard-client skill for ${CLIENT_DOMAIN}.
Environment: staging. Use --local-prompt for all rose-chat and rose-eval calls.
Target accuracy: 0.70. Max 3 eval iterations.
Mode: ${MODE}." \
--model sonnet --output-format json --max-turns 150 \
--dangerously-skip-permissions
;;
solve-chatbot-issues)
claude -p "Use the solve-chatbot-issues skill.
Check Sentry and Langfuse for recent chatbot failures.
Triage, fix, and open PRs for each issue." \
--model sonnet --output-format json --max-turns 100 \
--dangerously-skip-permissions
;;
*)
echo "Unknown job: $JOB"
exit 1
;;
esac
Jobs:
| Job | Skill | Trigger | What It Does |
|---|---|---|---|
solve-linear-ticket |
solve-linear-ticket |
Linear ai-ready (non-Onboarding projects), manual CLI |
Ticket → code → tests → self-review → PR |
onboard-client |
onboard-client |
Linear ai-ready (Onboarding project), manual CLI, KB update |
Config + skills + eval dataset + accuracy tuning |
solve-chatbot-issues |
solve-chatbot-issues |
Daily cron | Triage chatbot issues from Sentry/Langfuse, open fix PRs |
Self-review loop (for solve-linear-ticket):
repeat (max 3 iterations):
Opus scores each criterion (0-5)
if any score < 4:
Sonnet fixes the issues
re-run tests
else:
create/update PR, tag 'agent-ready'
notify Slack
exit
Invocation examples:
# Coding task from Linear ticket
gcloud workflows run factory --data='{"job":"solve-linear-ticket","ticket_id":"abc123","ticket_identifier":"IX-123"}'
# Client onboarding
gcloud workflows run factory --data='{"job":"onboard-client","client_domain":"acme.com","mode":"full"}'
# Daily chatbot triage (triggered by Cloud Scheduler)
gcloud workflows run factory --data='{"job":"solve-chatbot-issues"}'
Local Development¶
A CLI wrapper (rose-factory) makes it easy to run jobs locally in Docker:
#!/usr/bin/env bash
# rose-factory — run factory jobs locally
set -euo pipefail
usage() {
echo "Usage: rose-factory <job> [options]"
echo ""
echo "Jobs:"
echo " solve-linear-ticket --ticket IX-123"
echo " onboard-client --domain acme.com [--mode full|eval-only|update-skills]"
echo " solve-chatbot-issues"
echo ""
echo "Options:"
echo " --no-docker Run directly without Docker (requires Claude CLI installed)"
exit 1
}
[[ $# -lt 1 ]] && usage
JOB="$1"; shift
NO_DOCKER=false
EXTRA_ENV=()
while [[ $# -gt 0 ]]; do
case "$1" in
--ticket) EXTRA_ENV+=(-e "TICKET_ID=$2" -e "BRANCH=feature/$2"); shift 2 ;;
--domain) EXTRA_ENV+=(-e "CLIENT_DOMAIN=$2"); shift 2 ;;
--mode) EXTRA_ENV+=(-e "MODE=$2"); shift 2 ;;
--no-docker) NO_DOCKER=true; shift ;;
*) echo "Unknown option: $1"; usage ;;
esac
done
if [ "$NO_DOCKER" = true ]; then
# Run directly — extract env vars from EXTRA_ENV flags
for e in "${EXTRA_ENV[@]}"; do
[[ "$e" == "-e" ]] && continue
export "$e"
done
export JOB
exec ./factory/run.sh
fi
docker compose -f factory/docker-compose.yml run --rm \
-e "JOB=$JOB" "${EXTRA_ENV[@]}" \
factory
Examples:
# Run a Linear ticket locally in Docker
rose-factory solve-linear-ticket --ticket IX-123
# Onboard a client without Docker
rose-factory onboard-client --domain acme.com --mode full --no-docker
# Chatbot triage
rose-factory solve-chatbot-issues
Infrastructure¶
Container¶
FROM node:22-bookworm
RUN npm install -g @anthropic-ai/claude-code
RUN npx playwright install --with-deps chromium
RUN npm install -g supabase
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
# factory/docker-compose.yml
services:
factory:
build: .
entrypoint: ["./factory/run.sh"]
env_file: .env.agent
Directory Structure¶
factory/
├── Dockerfile # Container image (Claude Code + Playwright + Supabase CLI)
├── docker-compose.yml # Single factory service
├── run.sh # Job dispatcher entrypoint
├── self-review/
│ ├── rubric.md # Scoring criteria
│ └── review-loop.sh # Self-review orchestrator
├── orchestration/
│ ├── factory.yaml # Cloud Workflow definition
│ └── notify.sh # Slack notification helper
└── seeds/
├── base.sql # Empty schema, minimal seed
├── auth-users.sql # Admin + regular user with tokens
├── content-populated.sql # Realistic content volume
└── multi-tenant.sql # 2+ orgs with separate data
Authentication¶
Max subscription auth via ANTHROPIC_AUTH_TOKEN — $0/run (included in subscription).
# Generate a long-lived token and store it in GCP Secret Manager
./scripts/secrets/refresh-claude-token.sh
The script runs claude setup-token (opens browser), then prompts you to paste the token and stores it in Secret Manager. The token is injected into Docker containers at runtime via env var:
docker run \
-e ANTHROPIC_AUTH_TOKEN="$(gcloud secrets versions access latest --secret=CLAUDE_AUTH_TOKEN)" \
rose-agent claude -p "..."
Note:
claude -pinside Docker is the official CLI binary, so subscription OAuth works. Token expires periodically — refresh by re-running the script.
All secrets injected via GCP Secret Manager:
| Secret | Used By |
|---|---|
CLAUDE_AUTH_TOKEN |
Max subscription auth ($0/run) |
LINEAR_API_KEY |
Linear MCP server |
GITHUB_TOKEN |
Git push, PR creation |
SLACK_WEBHOOK_URL |
Notifications |
LANGFUSE_SECRET_KEY |
LLM observability |
GCE VM (one-time setup)¶
gcloud compute instances create rose-agent-runner \
--zone=europe-west9-a \
--machine-type=e2-medium \
--provisioning-model=SPOT \
--instance-termination-action=STOP \
--boot-disk-size=50GB \
--image-family=cos-stable \
--image-project=cos-cloud
Orchestration — GCP Cloud Workflows¶
Why Cloud Workflows (not Airflow):
| Cloud Workflows | Cloud Composer (Airflow) | |
|---|---|---|
| Cost | ~$0.001/execution | ~$250/mo minimum (GKE + Cloud SQL) |
| Setup | YAML file, deploy in 1 min | 30+ min environment creation |
| Long-running | Up to 1 year timeout | Unlimited |
| Maintenance | Zero (serverless) | Cluster upgrades, DAG syncing |
| Fit | Small team, simple flows | Large team, complex DAGs |
Cloud Composer is overkill for our use case. Cloud Workflows is serverless, costs pennies, and handles the linear flow (trigger → start VM → run container → notify) perfectly.
Factory workflow (single workflow, job dispatched via params):
# factory/orchestration/factory.yaml
main:
params: [args]
steps:
- init:
assign:
- job: ${args.job}
- job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}
- notify_start:
call: http.post
args:
url: ${SLACK_WEBHOOK_URL}
body:
text: "🏭 Factory started: ${job_label}"
- start_vm:
call: googleapis.compute.v1.instances.start
args:
project: ${GCP_PROJECT_ID}
zone: europe-west9-a
instance: rose-agent-runner
- run_agent:
call: googleapis.compute.v1.instances.runCommand
args:
project: ${GCP_PROJECT_ID}
zone: europe-west9-a
instance: rose-agent-runner
command: >
cd /app && git fetch origin
&& docker compose run --rm factory
env: ${args}
result: agent_result
- notify_done:
call: http.post
args:
url: ${SLACK_WEBHOOK_URL}
body:
text: "✅ Factory done: ${job_label}"
- stop_vm:
call: googleapis.compute.v1.instances.stop
args:
project: ${GCP_PROJECT_ID}
zone: europe-west9-a
instance: rose-agent-runner
except:
as: e
steps:
- notify_failure:
call: http.post
args:
url: ${SLACK_WEBHOOK_URL}
body:
text: "🚨 Factory FAILED: ${job_label}\nError: ${e.message}"
- stop_vm_on_error:
call: googleapis.compute.v1.instances.stop
args:
project: ${GCP_PROJECT_ID}
zone: europe-west9-a
instance: rose-agent-runner
- raise_error:
raise: ${e}
Triggers¶
Triggers are decoupled from jobs — any trigger can invoke any job via Cloud Workflows.
| Trigger | Source | Job | How |
|---|---|---|---|
Linear ai-ready label |
Linear webhook → Cloud Function | Routed by project | Cloud Function maps project → job (see below) |
GitHub repository_dispatch |
GitHub webhook | solve-linear-ticket |
Optional — if we want GitHub issues to trigger runs |
| Manual CLI | Developer's terminal | Any | gcloud workflows run factory --data='{"job":"...","ticket_id":"IX-123"}' |
| Daily cron | Cloud Scheduler | solve-chatbot-issues |
Daily triage of chatbot issues from Sentry/Langfuse |
| Weekly cron | Cloud Scheduler | onboard-client |
Scheduled eval-only runs for all active clients |
| KB update | Document upload webhook | onboard-client |
update-skills mode for the changed client |
Linear ai-ready routing: One label, one webhook — the Cloud Function routes by Linear project:
| Linear Project | Job |
|---|---|
| "Onboarding" | onboard-client (extracts client domain from ticket) |
| Any other project | solve-linear-ticket (default) |
Why triggers are flexible (not locked to Linear or GitHub):
- Same
ai-readylabel works across all projects — routing is automatic - Sometimes you just want to kick off a run from the CLI without creating a ticket
- Cron-based jobs (chatbot triage, re-evaluation) don't map to any issue tracker event
- GitHub integration is nice-to-have (auto-trigger from issue labels) but not required
Notifications & Supervision¶
Slack Notifications¶
Every factory run posts to a #agent-factory Slack channel:
| Event | Message |
|---|---|
| Run started | Factory type, ticket/client, branch |
| Progress update | Every 15 min during long runs (heartbeat) |
| Self-review scores | Per-criterion scores, pass/fail |
| Run completed | PR link or accuracy report |
| Run failed | Error summary, link to logs |
Implemented via a simple notify.sh helper that posts to a Slack webhook.
Langfuse Observability¶
Every claude -p call is traced in Langfuse (already configured in the project). This gives:
- Token usage per run — track cost per ticket / per client
- Latency breakdown — how long each agent turn takes
- Prompt versions — which skill versions were used
- Failure analysis — inspect the full agent conversation when something goes wrong
Cost Tracking¶
| Component | Cost |
|---|---|
| Claude Sonnet 4.6 (Max subscription) | $0/run (included in subscription) |
| GCE e2-medium spot | ~$0.01/hr (~$0.003–0.04/run) |
| Cloud Workflows | ~$0.001/execution |
| Slack webhook | Free |
| Langfuse | Free tier (50k traces/mo) |
| Total per run | ~$0.01–0.05 compute only |
The Max subscription makes this extremely cheap — the main cost is just the VM compute time. Still use --max-turns to prevent runaway loops that waste subscription quota.
Supervision Dashboard¶
No custom dashboard needed initially. Use existing tools:
- Langfuse → LLM traces, cost, latency
- Cloud Logging → VM and container logs
- Slack → real-time notifications
- GitHub → PR status, CI checks
- Linear → ticket status updates
If we need a consolidated view later, build a simple Cloud Monitoring dashboard.
Supabase Database Branching¶
Goal: Each solve-linear-ticket run gets its own isolated database.
Use Supabase Branching to create ephemeral DB branches per agent run:
Seed strategy:
| Seed File | Contents | When to Use |
|---|---|---|
base.sql |
Schema + minimal reference data | Default for all runs |
auth-users.sql |
Admin + regular users with mock tokens | Features touching auth |
content-populated.sql |
Realistic volume for UI testing | Frontend-heavy features |
multi-tenant.sql |
Multiple orgs with separate data | Multi-tenant features |
The onboard-client and solve-chatbot-issues jobs don't need DB branching — they use the staging environment directly.
Harness Engineering Best Practices¶
Based on Anthropic's harness engineering guidance and the broader cloud agent ecosystem, these practices apply to all jobs.
CLI vs SDK — When to Use Which¶
CLI (claude -p) |
Agent SDK (@anthropic-ai/claude-agent-sdk) |
|
|---|---|---|
| When | Scripts, CI/CD, existing skills, agent teams | Custom agent loops, programmatic control, non-coding agents |
| Pros | Zero code, reads .mcp.json + skills automatically, --resume for multi-turn |
Full event streaming, tool approval callbacks, session management |
| Cons | Less control over agent loop | Each user = separate subprocess, locked to Anthropic |
| Our choice | CLI for all jobs — skills and MCP servers work automatically | Consider SDK later for non-coding agents (legal review, data analysis) |
The SDK spawns the same CLI process under the hood. For our use case, the CLI is simpler and gets all skills/MCP servers for free.
Harness quality matters: Anthropic's benchmarks show the same Opus model scoring 78% with Claude Code's harness vs 42% with a naive harness on CORE benchmark — a 36-point gap from harness engineering alone.
Safety Layers (Defense in Depth)¶
--dangerously-skip-permissions is required in headless mode (no human to click "Allow"), but we mitigate risk through layered defenses:
Layer 1: Container isolation (microVM or Docker with no host socket)
Layer 2: Network allow/deny lists (only approved endpoints)
Layer 3: --allowedTools whitelist (restrict available tools)
Layer 4: PreToolUse hooks (block specific dangerous patterns)
Layer 5: Filesystem boundaries (agent can only write to workspace)
Layer 6: Cost circuit breakers (kill after $X or N errors)
Concrete implementation:
# Safer than naked --dangerously-skip-permissions:
# Note: --allowedTools uses permission rule syntax (space before * for prefix match)
claude -p "..." \
--model sonnet \
--max-turns 100 \
--output-format stream-json \
--allowedTools "Read,Write,Edit,Glob,Grep,Bash(git *),Bash(npm run *),Bash(just *),mcp__playwright *" \
--dangerously-skip-permissions
Alternatives to
--dangerously-skip-permissions(from safest to most permissive):
Method Best for Native sandbox (Seatbelt/bubblewrap) Day-to-day dev --allowedToolsper-session whitelistCI/CD with specific needs settings.jsonallowlistsTeam-wide policies PreToolUse hooks Custom business logic --permission-prompt-tool(MCP)Enterprise audit trails --permission-mode acceptEditsModerate autonomy Docker Sandbox + --dangerously-skip-permissionsFully automated (our choice)
PreToolUse hooks as guardrails (hooks fire even in bypass mode):
// .claude/settings.json — block dangerous patterns
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "echo $CLAUDE_TOOL_INPUT | python3 -c \"import sys,json; cmd=json.load(sys.stdin)['command']; sys.exit(1 if any(w in cmd for w in ['rm -rf /','docker socket','curl.*|.*sh','wget.*|.*bash']) else 0)\""
}]
}
]
}
}
Docker Container Hardening¶
FROM node:22-bookworm
# Claude Code CLI
RUN npm install -g @anthropic-ai/claude-code
# Playwright with Chromium
RUN npx playwright install --with-deps chromium
# Supabase CLI for DB branching
RUN npm install -g supabase
# Python tooling
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
# Non-root user for safety
RUN useradd -m agent
USER agent
WORKDIR /app
COPY --chown=agent:agent . .
Critical rules:
| Rule | Why |
|---|---|
Never mount /var/run/docker.sock |
Agent could escape sandbox via Docker API |
Use --network=none or allow-list |
Prevent data exfiltration via prompt injection |
| Non-root user inside container | Limit blast radius of filesystem operations |
| No host volume mounts for secrets | Use env vars from Secret Manager instead |
| Set resource limits | --memory=4g --cpus=2 to prevent resource exhaustion |
# factory/docker-compose.yml (hardened)
services:
factory:
build: .
entrypoint: ["./factory/run.sh"]
env_file: .env.agent
deploy:
resources:
limits:
memory: 4G
cpus: "2"
networks:
- agent-net
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
networks:
agent-net:
driver: bridge
# Allow-list outbound traffic via iptables rules on the host
Output Format & Error Handling¶
Use stream-json for real-time monitoring, json for structured results:
# Stream events in real-time (for heartbeat/progress monitoring)
claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
TYPE=$(echo "$line" | jq -r '.type // empty')
case "$TYPE" in
"assistant")
# Agent is producing output — heartbeat is alive
./factory/orchestration/notify.sh heartbeat "$TICKET_ID"
;;
"result")
# Final result — extract cost and session info
COST=$(echo "$line" | jq -r '.cost_usd')
TURNS=$(echo "$line" | jq -r '.num_turns')
./factory/orchestration/notify.sh completed "$TICKET_ID" "$COST" "$TURNS"
;;
"error")
./factory/orchestration/notify.sh failed "$TICKET_ID" "$(echo "$line" | jq -r '.error')"
;;
esac
done
# Check exit code
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
./factory/orchestration/notify.sh failed "$TICKET_ID" "Exit code: $EXIT_CODE"
fi
Cost Circuit Breakers¶
Prevent runaway agent spending:
# factory/code/run.sh — with circuit breakers
MAX_COST_USD=10.00
MAX_CONSECUTIVE_ERRORS=5
MAX_DURATION_SECONDS=7200 # 2 hours
STARTED_AT=$(date +%s)
claude -p "..." --output-format stream-json 2>&1 | while IFS= read -r line; do
# Check duration
ELAPSED=$(( $(date +%s) - STARTED_AT ))
if [ "$ELAPSED" -gt "$MAX_DURATION_SECONDS" ]; then
echo "CIRCUIT BREAKER: Duration exceeded ${MAX_DURATION_SECONDS}s"
kill %1 # Kill the claude process
exit 1
fi
# Check cost (from stream-json events)
COST=$(echo "$line" | jq -r '.cost_usd // 0')
if (( $(echo "$COST > $MAX_COST_USD" | bc -l) )); then
echo "CIRCUIT BREAKER: Cost exceeded \$${MAX_COST_USD}"
kill %1
exit 1
fi
done
Heartbeat & Liveness Monitoring¶
# factory/orchestration/heartbeat.sh — background process
HEARTBEAT_INTERVAL=900 # 15 minutes
HEARTBEAT_TIMEOUT=1800 # 30 minutes — agent is stuck if no output for this long
LAST_ACTIVITY=$(date +%s)
monitor_heartbeat() {
while true; do
sleep "$HEARTBEAT_INTERVAL"
IDLE_TIME=$(( $(date +%s) - LAST_ACTIVITY ))
if [ "$IDLE_TIME" -gt "$HEARTBEAT_TIMEOUT" ]; then
./factory/orchestration/notify.sh stuck "$TICKET_ID" "No activity for ${IDLE_TIME}s"
# Don't kill — human decides
else
./factory/orchestration/notify.sh heartbeat "$TICKET_ID" "Running (${IDLE_TIME}s since last activity)"
fi
done
}
Long-Running Agent Sessions¶
Per Anthropic's guidance, context windows are limited and complex projects can't complete in a single session. Claude Code handles this via compaction — automatically summarizing context when approaching limits. Key practices:
- Let compaction work: Don't set
--max-turnstoo low. Sonnet can do 100+ turns with compaction - Checkpoint commits: Configure the agent to commit after each sub-task, creating rollback points
- Session continuity: If a session fails mid-way, re-run with
--resumeto continue from the last checkpoint - CLAUDE.md as persistent memory: The agent re-reads CLAUDE.md after compaction, so project instructions survive context resets
Cloud Workflows Error Handling¶
# factory/orchestration/factory.yaml — with proper error handling
main:
params: [args]
steps:
- init:
assign:
- job: ${args.job}
- job_label: ${args.job + " — " + default(args.ticket_identifier, default(args.client_domain, ""))}
- project_id: ${sys.get_env("GCP_PROJECT_ID")}
- notify_start:
call: http.post
args:
url: ${sys.get_env("SLACK_WEBHOOK_URL")}
body:
text: ${"Factory started - " + job_label}
- start_vm:
try:
call: googleapis.compute.v1.instances.start
args:
project: ${project_id}
zone: europe-west9-a
instance: rose-agent-runner
retry:
predicate: ${default_retry_predicate}
max_retries: 3
backoff:
initial_delay: 2
max_delay: 30
multiplier: 2
- wait_for_vm:
call: sys.sleep
args:
seconds: 30
- run_agent:
try:
call: http.post
args:
url: ${"https://compute.googleapis.com/compute/v1/projects/" + project_id + "/zones/europe-west9-a/instances/rose-agent-runner/runCommand"}
auth:
type: OAuth2
body:
command: >
cd /app && git fetch origin
&& docker compose run --rm factory
env: ${args}
timeout: 7200 # 2 hour timeout
result: agent_result
except:
as: e
steps:
- notify_error:
call: http.post
args:
url: ${sys.get_env("SLACK_WEBHOOK_URL")}
body:
text: ${"Factory FAILED - " + job_label + " - Error: " + e.message}
- stop_vm_on_error:
call: googleapis.compute.v1.instances.stop
args:
project: ${project_id}
zone: europe-west9-a
instance: rose-agent-runner
- raise_error:
raise: ${e}
- notify_done:
call: http.post
args:
url: ${sys.get_env("SLACK_WEBHOOK_URL")}
body:
text: ${"Factory done - " + job_label}
- stop_vm:
call: googleapis.compute.v1.instances.stop
args:
project: ${project_id}
zone: europe-west9-a
instance: rose-agent-runner
- secrets_access:
# Secrets are injected into the container via .env.agent
# Generated at VM boot from Secret Manager
assign:
- note: "Secrets loaded from GCP Secret Manager at container start"
Multi-Agent Coordination¶
When running multiple factory instances in parallel:
| Concern | Solution |
|---|---|
| Git conflicts | Each agent gets its own feature branch — never share branches |
| Shared config files | Agents work on isolated worktrees (git worktree add) |
| DB conflicts | Each solve-linear-ticket run gets its own Supabase branch |
| Resource contention | One agent per VM, or multiple VMs with distinct names |
| Merge conflicts on shared files | Detect hotspot files (routes, configs) — flag for human resolution |
Rollback Strategy¶
Agent creates checkpoint commits after each sub-task
↓
Self-review fails → git reset to last good checkpoint
↓
Human review fails → git revert the entire branch
↓
Production issue → branch was never merged to main (always PR-based)
The key insight: agents should never push directly to develop or main. Always via feature branch + PR. The PR is the human approval gate.
Leveraging Existing Skills¶
| Skill | Job | Role |
|---|---|---|
solve-linear-ticket |
solve-linear-ticket |
Parse ticket into sub-issues, implement each |
test-backend / test-frontend |
solve-linear-ticket |
Run tests after implementation |
mypy |
solve-linear-ticket |
Type checking (mandatory per CLAUDE.md) |
create-pr |
solve-linear-ticket |
Create the PR when done |
pr-fix |
solve-linear-ticket |
Address review comments from human reviewer |
smoke |
solve-linear-ticket |
Run smoke tests before commit |
architecture-review |
solve-linear-ticket |
Part of self-review rubric |
simplify |
solve-linear-ticket |
Review changed code for quality |
solve-chatbot-issues |
solve-chatbot-issues |
Daily triage: detect chatbot issues, create fixes, open PRs |
onboard-client |
onboard-client |
Full client onboarding workflow (config, skills, evals) |
create-client-skills |
onboard-client |
Create/update skills and eval dataset for a client |
prompt-engineering |
onboard-client |
Fix individual skill failures |
browse-playground |
Any | Visual verification via Playwright |
Consequences¶
Positive¶
- Cheap: Max subscription + spot VMs + Cloud Workflows = near-zero marginal cost per run
- Three use cases covered: Coding tasks, client onboarding, and daily chatbot issue triage
- Flexible triggers: Not locked to one system — Linear, GitHub, cron, CLI all work
- Observable: Slack notifications + Langfuse traces = know what's happening without watching
- Incremental: Each factory and each trigger can be built independently
Negative¶
- Complexity: Cloud Workflows, triggers, notifications — more moving parts
- Sonnet limitations: Some complex tasks may need Opus (self-review already uses it)
- Token refresh: Max subscription OAuth tokens expire — need monitoring and a process to refresh via
claude login - Trust calibration: Team needs to learn what each job handles well
Neutral¶
- Cloud Workflows is a new GCP service for the team, but it's simple YAML
- Adding a new job is just a new
caseinrun.sh— minimal new logic - Existing CI/CD workflows are unaffected; the factory is additive
Alternatives Considered¶
1. Airflow / Cloud Composer¶
Rejected: $250+/mo minimum, complex setup, overkill for linear workflows. Cloud Workflows does the same job for pennies.
2. GitHub Actions as sole orchestrator¶
Possible but not required. GitHub Actions can trigger Cloud Workflows via gcloud CLI. Useful if we want GitHub issue labels to trigger runs. But the factory doesn't depend on GitHub — it's triggered by Cloud Workflows which can be invoked from anywhere.
3. Prefect / Temporal¶
Deferred: Good alternatives if Cloud Workflows proves too limited. Prefect has a nice UI and Python-native DAGs. Temporal is better for long-running stateful workflows. Both require self-hosted infrastructure or paid cloud tiers.
4. Opus for everything¶
Rejected: Opus is 5-10x slower and would burn through Max subscription limits faster. Sonnet 4.6 is good enough for implementation. Opus reserved for judgment calls (self-review scoring).
5. Separate repo for factory¶
Deferred: Co-location keeps factory evolving with codebase. Extract later if it grows.
6. Composio / CCPM / TSK / Ralphy¶
Same analysis as before — none provide the full loop we need. Ralphy remains an optional inner-loop wrapper.
Implementation Phases¶
Phase 1: Factory Infra (Week 1-2)¶
- [ ] Create
factory/directory structure - [ ] Build Dockerfile (Claude Code + Playwright + Supabase CLI)
- [ ] Create
factory/run.shjob dispatcher - [ ] Set up GCE spot VM with Container-Optimized OS
- [ ] Store secrets in GCP Secret Manager
- [ ] Create
notify.shSlack helper - [ ] Test: run
claude -pin container with Max subscription auth
Phase 2: First Job — solve-linear-ticket (Week 3-4)¶
- [ ] Add
solve-linear-ticketcase torun.sh - [ ] Implement self-review loop (Sonnet implements, Opus reviews)
- [ ] Create Cloud Workflow definition (
factory.yaml) - [ ] Test: manually trigger with a simple Linear ticket
Phase 3: More Jobs (Week 5-6)¶
- [ ] Add
onboard-clientcase with 3 modes (full, eval-only, update-skills) - [ ] Add
solve-chatbot-issuescase - [ ] Test: full onboarding run for an existing client
- [ ] Test: chatbot triage run
Phase 4: Triggers & Scheduling (Week 7-8)¶
- [ ] Extend Linear webhook Cloud Function to trigger factory workflow
- [ ] Set up Cloud Scheduler for daily
solve-chatbot-issuesand weeklyonboard-clienteval-only - [ ] Add manual CLI trigger scripts
- [ ] Optional: GitHub
repository_dispatchintegration - [ ] Test: end-to-end from Linear label to PR
Phase 5: Supabase Branching (Week 9-10)¶
- [ ] Enable Supabase Branching on the project
- [ ] Create seed files (
base.sql,auth-users.sql, etc.) - [ ] Add seed selection logic to
solve-linear-ticketjob - [ ] Test: agent creates and uses a Supabase branch
Phase 6: Polish & Observe (Week 11-12)¶
- [ ] Add heartbeat notifications (every 15 min during long runs)
- [ ] Build cost tracking (tokens per run via Langfuse API)
- [ ] Run
solve-linear-ticketon 5-10 real tickets, collect metrics - [ ] Run
onboard-clienton 3-5 clients, compare with manual results - [ ] Run
solve-chatbot-issuesfor a week, assess triage quality - [ ] Tune self-review thresholds and max iterations
References¶
Core¶
- Claude Code Headless Mode
- Claude Code Best Practices
- Claude Agent SDK Overview
- Effective Harnesses for Long-Running Agents (Anthropic)
- Claude Code Sandboxing (Anthropic Engineering)
Infrastructure¶
Integrations¶
Community & Tools¶
- Sam Hessanower — Agentic Dev Pipeline (AI Tinkerers, Aug 2024)
- Ralphy — Autonomous Agent Loop
- learn-claude-code — Harness Engineering
- everything-claude-code — Agent Harness Optimization
- claude-code-harness — Plan→Work→Review Cycle
- Open Harness — Model-Agnostic Agent Harness
- DiffBack — Granular AI Agent Rollback
- E2B — Cloud Sandboxes for Agents
- Faker.js — Fake Data Generator