Skip to content

ADR: Productized Website Scraping Pipeline for Client Onboarding

Status

Draft

Date

2026-06-11

Context

Client onboarding needs a repeatable way to turn a public website into clean markdown for the Website Agent knowledge base. Today this works, but it is a local, operator-driven workflow:

  1. An agent follows the rose-scrape-website skill.
  2. It maps the site from a sitemap or same-domain crawl, samples sections, asks the operator which sections to keep, scrapes pages into backend/apps/scraping/data/<domain>/, writes a site-specific clean.py, runs generic cleanup and RAG-friendly checks, then hands off markdown files for ingestion.
  3. The final knowledge is usually shared through Google Drive and then consumed by the existing document-loader job.

That workflow produced useful first-version results because it encoded the right instincts:

  • Curate URLs before crawling. Blind recursive crawling pulls tag pages, author pages, stale blogs, event pages, form pages, and other low-signal material.
  • Sample pages by section before scraping everything.
  • Preserve source URLs and frontmatter.
  • Remove boilerplate, CTA blocks, duplicate carousels, nav/footer chrome, and CMS artifacts.
  • Convert RAG-hostile structures like feature tables into self-contained prose.
  • Validate against the LightRAG ingestion path instead of treating markdown as cosmetic output.

The current process has clear limits:

  • It runs locally in Claude Code or Codex and depends on ad hoc scripts.
  • There is no durable run state, review queue, or artifact lifecycle.
  • There is no backoffice UI to inspect what was mapped, included, excluded, scraped, cleaned, or failed.
  • Adding or removing a page after the scrape requires manual file edits.
  • Google Drive is acting as a handoff location, not as a structured artifact store.
  • The metadata contract has drift: existing scraper output and active validators use source_url, while some RAG-facing guidance mentions url and source_subtype.
  • The current document loader uses file/document identifiers in ways that can collide if two pages share the same filename stem.

The downstream constraints are concrete:

  • The document-loader Cloud Run job is the existing writer into LightRAG, Neo4j, and MongoDB.
  • The loader strips markdown frontmatter before indexing, but reinjects title and description into indexed content and uses source_url as the document source.
  • Incremental ingestion is content-hash based.
  • Removed documents need tombstones with <!-- nullified --> or old chunks can remain indexed.
  • The active chunker is heading-aware: it splits markdown on H1-H3, prepends a [Section: ...] context line, leaves H4+ inline, merges sections under 16 tokens, and otherwise falls back to 512-token chunks with overlap.
  • Supabase is one shared project across test, staging, and production, so scrape metadata writes are production writes regardless of IX_ENVIRONMENT.
  • Long scraped documents are not FAQ rows. config.knowledge_faqs is for curated short Q&A retrieval entries; website pages should become reviewed markdown documents or structured lookup rows.

Firecrawl is a good fit for the crawling layer. Current Firecrawl v2 docs expose:

  • Map API for URL inventory with sitemap options, subdomain controls, query-parameter filtering, limits, and titles/descriptions.
  • Crawl/Scrape options such as markdown output and onlyMainContent.
  • Monitor API with scheduled scrape/crawl targets, webhooks, goal-based judging, and scrape options including onlyCleanContent.
  • Change tracking as a crawl format.

Firecrawl should be treated as the crawler and pre-cleaner, not as the whole knowledge-ingestion system. We still need Rose-specific page selection, source provenance, RAG-aware cleanup, review, versioning, and LightRAG handoff.

Decision

Build a productized Website Knowledge Scraping Pipeline with four separate responsibilities:

Backoffice UI
  -> scrape-run API / RPC
  -> Cloud Run scraping job
  -> Firecrawl map/crawl/scrape
  -> GCS versioned artifacts + Supabase review state
  -> approved corpus version
  -> existing document-loader job
  -> LightRAG / Neo4j / MongoDB

The scrape job produces a reviewed corpus artifact. The existing document-loader remains the only component that mutates LightRAG stores.

1. Use an explicit workflow, not a general agent framework

The first production version should be a Cloud Run Job with an explicit state machine:

  1. mapping
  2. section_sampling
  3. page_triage
  4. scraping
  5. deterministic_preflight
  6. llm_cleanup
  7. validation_audit
  8. ready_for_review
  9. approved
  10. published

We should not introduce Mastra, LangChain, or a custom autonomous agent loop for this. The workflow is sequential and inspectable; the uncertainty lives in bounded LLM decisions, not in tool planning.

Headless Codex, Claude Code, OpenCode, or similar developer agents may be used later as an implementation detail for bounded content-review workers, but they must not be the primary production contract. In production, no worker should generate or execute arbitrary per-client Python cleanup code by default. Site-specific rules should be stored as data and applied by known cleanup functions where possible; when they cannot be expressed safely, the page should be flagged for review instead of running new code.

Example:

  • Good production shape: "classify these 400 mapped URLs into include/exclude with reasons and confidence."
  • Risky production shape: "ask a coding agent to write and run a custom clean.py against customer artifacts."

2. Use Firecrawl v2 for mapping, scraping, and monitoring

Firecrawl becomes the default crawler.

Mapping:

  • Call Firecrawl Map for the target domain.
  • Default to sitemap-inclusive mapping.
  • Ignore query parameters by default.
  • Do not include subdomains by default unless they are registered Rose domains or manually approved.
  • Store every mapped URL, title, description, source, and map metadata.

Triage:

  • Group mapped URLs by path section and template shape.
  • Sample representative pages from each section.
  • Use a stronger model for section/page inclusion decisions.
  • Store decisions as structured records: include, exclude, needs_review, with reason, confidence, and evidence.
  • Apply default recency policy to blog/news sections because LightRAG retrieval does not filter by publish date.

Scraping:

  • Scrape only included URLs, plus manually added URLs.
  • Request markdown output.
  • Enable Firecrawl pre-cleaning options such as onlyMainContent and evaluate onlyCleanContent as an option, but do not rely on them as the final cleaner.
  • Persist Firecrawl metadata and raw markdown before any Rose cleanup.
  • Store failed pages with error details; do not silently drop them.

Monitoring:

  • After a corpus is published, create Firecrawl monitor records for approved page sets or crawl targets.
  • Firecrawl monitor/change events should start a delta scrape run.
  • Phase 1 monitor events create a draft corpus update; they do not auto-publish into LightRAG.
  • Auto-publish can be considered later for low-risk changes that pass all quality gates.

3. Store artifacts in GCS and state in Supabase

Use a private GCS bucket for large artifacts and immutable corpus versions. Supabase stores indexed metadata, review state, and pointers.

Proposed GCS layout:

gs://rose-scraping/
  domains/{domain}/
    runs/{run_id}/
      map/links.json
      map/sections.json
      samples/{page_id}.md
      raw/{page_id}.md
      clean/{page_id}.md
      reports/audit.json
      reports/llm-cleaning.jsonl
      manifest.json
    corpora/v{version}/
      manifest.json
      pages/{doc_id}.md
      tombstones/{doc_id}.md

Supabase should get a dedicated operational schema, for example scraping, not new JSON blobs in config.client_configs.

Initial tables:

  • scraping.runs
  • domain, environment, target_url, status, requested_by, started_at, completed_at, failed_at, error_message
  • firecrawl options, model choices, GCS run prefix, selected corpus version
  • scraping.pages
  • run_id, domain, environment, page_id, canonical_url, normalized_path, section, title, description
  • mapped/sampled/scraped/cleaned status
  • decision, decision_source, decision_reason, confidence
  • raw_uri, clean_uri, content_hash, doc_id, source_subtype, language
  • quality flags: thin, duplicate, missing_h2, pipe_table, sensitive_marker, form_or_error
  • scraping.corpus_versions
  • domain, environment, version, source_run_id, status, gcs_prefix, manifest_uri
  • approved_by, approved_at, published_at, document_loader_run_id
  • scraping.page_overrides
  • domain, canonical_url, override_decision, reason, created_by, created_at
  • reused across future runs so the system learns from operator decisions
  • scraping.firecrawl_monitors
  • domain, monitor_id, target_type, target_config, schedule, enabled, last_check_summary

Backoffice reads these tables directly for listing and review, subject to RLS. Starting jobs, creating GCS objects, and calling Firecrawl must happen server-side, not from browser credentials. Prefer the admin/backoffice API for privileged actions because it already has Supabase bearer auth and per-domain access checks; direct Supabase calls are fine for low-risk listing and review updates.

4. Use LLM cleanup with deterministic guards

Firecrawl pre-cleaning reduces noise, but Rose still needs a RAG-aware cleanup layer. The LLM should be the primary cleaner for semantic decisions. Deterministic code should surround that LLM step as preflight, normalization, and validation, not attempt to encode every site's content quirks.

The production shape is:

Firecrawl pre-clean
  -> deterministic preflight and normalization
  -> LLM cleanup
  -> deterministic validation and audit

Stage A: deterministic preflight and normalization.

This does not mean every current clean.py can become a generic library function. The useful extraction boundary is lower-level: turn recurring operations into safe, composable primitives, then represent site-specific behavior as declarative rules and review flags.

Refactor the reusable parts of the existing skill scripts into importable library functions:

  • structural markdown reformatting
  • duplicate block detection and obvious repeated boilerplate removal
  • CMS-specific repairs, starting with Webflow FAQ and label/value fixes
  • frontmatter validation
  • thin-page, duplicate, non-English, form/error audits
  • PDF stub recovery where linked PDFs are the real content

Store client/page-specific cleanup as data, not as newly generated Python. For example:

remove_blocks_matching:
  - "Book a demo"
  - "Subscribe to our newsletter"
drop_sections:
  - heading: "Related articles"
    when_path_matches: "/blog/"
promote_faq_questions: true
flag_for_review_when:
  - contains_pricing_toggle
  - has_table_with_missing_headers

Some cleanup will remain too ad hoc to automate deterministically. For example, a pricing page where URL parameters switch between SMB and enterprise plans should not be "fixed" by a generic script; it should become needs_variant_review and either get a manually approved synthesis or a later variant-aware browser workflow.

Stage B: LLM cleanup.

Run a per-page LLM cleaner after deterministic preflight. Use small/cheap models for mechanical cleaning, and reserve stronger models for inclusion decisions, ambiguous pages, comparison pages, and variant-heavy pricing pages.

The LLM cleaner contract:

  • Input: raw markdown, cleaned draft markdown, page metadata, client/product name, source URL, and current RAG-friendly rules.
  • Output: cleaned markdown plus structured metadata.
  • Allowed operations: remove boilerplate, remove CTAs, remove navigation, restructure headings, convert tables to prose, make sections self-contained, add source subtype, flag ambiguity.
  • Forbidden operations: invent facts, summarize away concrete product details, change prices or claims, add unsupported claims, remove source URL.
  • Required explanation: changed_sections, removed_boilerplate_examples, quality_flags, and confidence.

Stage C: deterministic validation and audit.

After the LLM returns markdown, run deterministic checks again. These checks decide whether the page is publishable, needs review, or should be excluded. They should verify:

  • required frontmatter is still present
  • source_url still matches the original canonical URL
  • doc_id is unique within the corpus
  • the document has useful H1-H3 structure for the current chunker
  • pipe tables, sensitive markers, duplicate boilerplate, form/error pages, and very thin pages were not reintroduced
  • tombstones are created for removed pages before publish

The cleaned markdown should optimize for the current chunker:

  • one specific H1
  • H2 sections for the main retrievable topics
  • H3 for meaningful subtopics and FAQ questions
  • H4+ only for inline structure, because they do not create chunk boundaries
  • headings that repeat the key entity, plan, product, region, or competitor
  • paragraphs that make sense when retrieved alone
  • no feature/pricing pipe tables when prose can preserve the meaning better

Example target shape:

---
title: "Acme pricing plans"
description: "Acme plan tiers, limits, and enterprise pricing policy."
language: en
source_url: https://www.acme.com/pricing
source_subtype: web_page
doc_id: web_8f4d9b2a_pricing
---

# Acme pricing plans

## Pricing - Starter plan

The Acme Starter plan includes ...

## Pricing - Enterprise plan

The Acme Enterprise plan includes SSO and custom security review. The Starter plan does not include SSO.

5. Standardize the markdown metadata contract

For the productized pipeline, keep source_url as the canonical URL field because the active loader and validator already recognize it.

Add fields without breaking the loader:

  • doc_id: stable unique document identifier
  • source_subtype: web_page, blog_post, case_study, pricing_page, competitive_battlecard, pdf, etc.
  • language
  • published_at when known
  • run_id
  • page_id
  • content_hash

The RAG-friendly docs should be updated to stop using url as the required field unless the loader is migrated at the same time.

Document IDs must be globally unique within a tenant. The current filesystem loader derives IDs from filename stems, so the corpus should either:

  • use a new GCS loader that reads doc_id from manifest.json, or
  • emit filenames prefixed by a stable page hash, for example web_8f4d9b2a_pricing.md.

The GCS loader is preferred because it gives us an explicit manifest and avoids baking loader semantics into filenames.

6. Publish through the existing document-loader job

The scraping job stops at "approved corpus version." It does not call LightRAG directly.

Publishing does this:

  1. Materialize corpora/v{version}/pages/*.md.
  2. Add tombstone files for pages that were previously published but are now removed or excluded.
  3. Trigger the existing document-loader Cloud Run Job with TENANT_ID={domain} and SOURCE_DIR=gs://rose-scraping/domains/{domain}/corpora/v{version}.
  4. Record the document-loader execution ID and final result on scraping.corpus_versions.

This requires adding a GcsDocumentLoader to DocumentLoaderFactory. It should read manifest.json, load only approved page files and tombstones, and set each Document.doc_id from the manifest.

This preserves the current ingestion safety model:

  • LightRAG writes remain centralized.
  • Content hashes still drive incremental updates.
  • Existing tenant isolation and document-loader watchdogs still apply.
  • Tombstones delete removed content instead of leaving stale chunks behind.

7. Add a backoffice review surface

Add a staff-only page under Knowledge, for example Knowledge -> Website Corpus.

The page should support:

  • start a new scrape run from a target domain
  • see pipeline status and errors
  • inspect mapped URLs in a tree/table grouped by section
  • see included, excluded, needs-review, failed, and manually added pages
  • open raw vs cleaned markdown preview for a page
  • edit include/exclude decisions and persist overrides
  • add a URL that mapping missed
  • remove a mapped page before publish
  • re-run scrape/clean for selected pages
  • approve a corpus version
  • publish a corpus version into the document-loader
  • see Firecrawl monitor status and changed-page events

The UI should surface "what was not scraped" as first-class information, not only what made it into the corpus. Excluded pages need reasons because future operators need to understand whether a page was intentionally excluded or accidentally missed.

The first implementation should be staff-only. Client-facing approval can come later once the output vocabulary, failure states, and diff previews are proven internally.

8. Keep special pages explicit

Some pages should not go through the normal crawl path:

  • pricing pages with URL parameters
  • comparison tables behind toggles
  • calculators
  • pages where audience filters change the plan/offer matrix
  • JS-heavy pages where the markdown loses the actual content

These pages should be flagged as needs_variant_review. The productized pipeline can later add a variant mode using Firecrawl actions/interactions or a browser worker, but Phase 1 should not pretend a single static scrape captures all variants.

9. Safety and operations

  • Store Firecrawl keys in Secret Manager or the job runtime secret system, never in the browser.
  • Keep the GCS bucket private. Backoffice previews should use short-lived signed URLs or backend-rendered content.
  • Add per-domain run locks so two scrape jobs do not publish competing corpus versions.
  • Add a global concurrency cap for Firecrawl and LLM calls.
  • Store LLM prompts, model, temperature, and output JSON with each run for traceability.
  • Reject URLs outside the selected domain or explicit allowed-host list.
  • Respect site policies and rate limits. The old local scraper used stealth behavior; the productized system should use a crawler vendor and explicit rate controls, not hidden browser mimicry as the default posture.
  • Treat monitor events as change signals, not as automatic knowledge writes.
  • Use Cloud Tasks only for small idempotent chunks such as monitor webhooks or per-page re-clean jobs; the full-site run belongs in a Cloud Run Job.
  • Keep website corpus ingestion separate from config.knowledge_faqs and geo.faqs.
  • Add Slack/Sentry notifications for failed runs and failed publishes.

Consequences

Positive

  • The onboarding scrape becomes repeatable, reviewable, and runnable from backoffice instead of a local agent session.
  • Operators can see what was mapped, skipped, scraped, cleaned, failed, approved, and published.
  • GCS gives us immutable corpus versions, raw/clean diffability, and a clean replacement for Google Drive as source of truth.
  • The existing document-loader remains the only LightRAG writer, reducing blast radius.
  • LLM cleanup can optimize content for actual retrieval behavior instead of only prettifying markdown.
  • Firecrawl monitors give us a path to freshness without building our own crawler scheduler.

Negative

  • Firecrawl becomes a vendor dependency for onboarding and freshness.
  • We add new backend/job/UI surface area instead of a single local skill.
  • LLM cleaning can remove useful detail or rewrite meaning unless prompts, diff previews, and quality gates are strict.
  • A GCS document loader and corpus manifest contract must be built before publish is fully automated.
  • Firecrawl monitor events may be noisy; humans still need review in Phase 1.

Neutral

  • Google Drive becomes optional export/share, not the system of record.
  • The old Playwright scraper can remain a fallback for sites Firecrawl cannot handle, but it should not be the default product path.
  • Headless coding agents remain useful for internal experimentation, but the production workflow should expose explicit states and artifacts.

Alternatives Considered

1. Keep the local skill workflow

Rejected. The local workflow works, but it cannot provide durable state, UI review, artifact versioning, or autonomous runs from backoffice.

2. Port the existing Playwright scraper to Cloud Run

Rejected as the default. It preserves too much crawler maintenance burden and still leaves us building mapping, pre-cleaning, retries, and monitoring ourselves. Keep it as a fallback for edge cases.

3. Use Firecrawl output directly with no Rose cleanup

Rejected. onlyMainContent and onlyCleanContent are useful pre-cleaning tools, but they do not know Rose's chunker, LightRAG retrieval failure modes, competitive attribution rules, tombstone behavior, or frontmatter contract.

4. Store markdown bodies in Supabase

Rejected. Supabase should store state and metadata. Large raw/clean artifacts, versioned corpora, and manifests belong in object storage.

5. Let Firecrawl monitor changes auto-ingest into LightRAG

Rejected for Phase 1. Auto-ingestion risks publishing noisy or incorrect changes without review. Monitor events should create draft delta runs first.

6. Build a full agent framework

Rejected for now. The workflow is narrow enough for a typed state machine plus bounded LLM calls. A full agent framework would add complexity before we know we need it.

Open Questions

  • Should the first UI be staff-only or exposed to clients with a restricted approval flow?
  • What is the default blog/news recency window for productized onboarding: 12, 18, or 24 months?
  • Which subdomains should be included automatically, if any?
  • What Firecrawl monitor schedule balances freshness and cost?
  • Do we need corpus-level evaluation before publish, for example a fixed set of Website Agent product-discovery questions?
  • Should source_subtype influence retrieval weighting later, or only metadata and audit reporting?

References