June 2026 - R&D Journal: Experiment Marts, Productized Scraping, and Relay-Managed Agents¶
Context¶
This entry covers June 1-18, 2026. June continued May's platform research with three stronger R&D lines: experiment marts for causal measurement, productized website scraping for repeatable knowledge onboarding, and relay-managed agents for factory coordination.
Closed-Loop Agent Evaluation and Optimization - Experiment Marts and Observability¶
Project and lock¶
The goal was to make Rose experiments and language-model behavior measurable with repeatable metric definitions and product-specific telemetry. Rose experiments involve delayed, multi-event outcomes — prior exposure, eligible URLs, bounce handling, soft conversions, pricing-page reach, and client-specific conversion filters — that generic dashboards and ad hoc SQL cannot turn into reliable causal conclusions. May had introduced the layered analytics architecture and early A/B tooling; June tested the next layer: experiment marts, metric descriptors, Bayesian analysis, scheduled refreshes, and chat and language-model observability.
This month's work¶
The hypothesis was that precomputed experiment marts plus explicit metric descriptors and product-specific telemetry produce more reliable experimentation than ad hoc dashboard queries. The work materialized experiment results into precomputed marts on a scheduled refresh, with explicit metric descriptors covering experiment metrics, cohort gates, prior-exposure handling, eligible-URL filtering, Bayesian win probability, and per-client conversion-filter accuracy. Alongside the marts, chat and language-model telemetry was added and corrected — chat latency, per-call model metrics, reply counts, metric temporality, in-flight gauges, and worker telemetry initialization — and the product-specific observability approach was documented.
Results, proof, and next step¶
The first learning was that experiment metrics must be materialized and described, or metric drift reappears across staff tools. The second was that chat observability must be scoped by active site, domain, and client — global reply counts are not enough for product decisions. The negative learning was that telemetry temporality and worker-initialization details can invalidate dashboards even when application behavior is correct. The infrastructure is ready, but business conclusions still require experiment runtime and traffic. Next step: use accumulated experiment data to validate specific product hypotheses and compare metric stability against the prior dashboard-query approach.
Compounding Context Engine for Company, Industry, and Buyer Intelligence - Productized Knowledge Intake¶
Project and lock¶
The goal was to reduce onboarding variance by converting arbitrary client websites into cleaner retrieval-ready knowledge and skill/evaluation inputs. Scrape quality varies by content-management system, duplicated layout blocks, hidden content, long-form pages, localized pages, and inconsistent metadata, and poor scrape quality propagates into retrieval and skill behavior. May had shown that factory workflows and outbound publishing need better content inputs; June addressed the input side through a productized scraping pipeline.
This month's work¶
The hypothesis was that a standardized scraping pipeline with content-management-system handling, duplicate-block detection, post-clean sweeps, and evaluation-seed generation can reduce manual knowledge-onboarding variance. The work built a standardized scraping pipeline that handles differing content-management systems, detects and drops duplicated layout blocks, applies post-clean sweeps, and filters long-form pages, turning arbitrary client sites into cleaner retrieval-ready units. Evaluation seeds were generated alongside onboarding so scrape quality could be checked, and answer-quality work connected skill composition to deterministic suggested-answer evaluations. The pipeline design was recorded in a decision record.
Results, proof, and next step¶
The first learning was that repeatable scraping is less about crawling more pages and more about cleaning page structure into retrieval-friendly atomic units. The second was that evaluation seeds need to be created alongside onboarding, or scrape quality is hard to validate. The remaining uncertainty is that the pipeline reduces variance but does not guarantee quality without review and evaluation feedback. Next step: feed scraped content into the evaluation loops so retrieval quality is measured rather than assumed.
Closed-Loop Agent Evaluation and Optimization - Relay-Managed Factory Work¶
Project and lock¶
The goal was to coordinate multi-step factory work through relay-managed agents so that agent tasks stay traceable across delegation, execution, review, and follow-up. Manual task tracking loses state when work is split across agents and steps; the unresolved problem was agent-work orchestration — who owns the task, what evidence proves completion, what remains blocked, and how follow-up is tracked.
This month's work¶
The hypothesis was that a relay-managed agent control plane can make factory work more observable and reduce lost follow-up compared with manual handoffs. The work built a relay-managed control plane for factory task coordination: a relay that tracks ownership, completion evidence, blocked state, and follow-up as work passes through delegation, execution, review, and follow-up, packaged as a standalone installable tool. A session-reflection workflow and a configuration-CLI guard fix supported the same factory loop. The relay design was recorded in a decision record.
Results, proof, and next step¶
The learning was that factory agent work needs a control plane, not only better prompts. June established the relay capability but did not yet measure throughput, task-completion rate, or recurrence reduction. Next step: run the relay at volume and measure whether it reduces lost handoffs and repeated defects.
Compounding Context Engine for Company, Industry, and Buyer Intelligence - Enrichment-Backed Qualification¶
Project and lock¶
The goal was to use enrichment as evidence for qualification so Rose asks fewer redundant questions while preserving data quality. A qualification field can be answered by the visitor, inferred from conversation, or supplied by enrichment, and combining those sources is risky: enrichment may be incomplete, stale, or not applicable to a specific coverage question.
This month's work¶
The work let qualification fields be satisfied from enrichment data rather than always asking the visitor, with a runtime coverage lookup deciding whether a given field is actually covered and an explicit decline path when it is not. Visitor identity and profile surfaces were refreshed to expose the resolved data, and a public interface was added for post-conversion qualification.
Results, proof, and next step¶
The positive learning was that enrichment can be treated as qualification evidence only when field coverage is explicit. The negative learning was that the system needs decline/unsupported behavior, or enrichment creates false confidence. The remaining uncertainty is that redundant-question reduction and qualification-completion impact remain to be measured. Next step: measure whether enrichment-backed qualification reduces redundant questions without lowering data quality.
Non-R&D / Productization Context¶
The June changelog includes visitor profiles, Qualification tab, CSV exports, Portuguese support, GEO subdirectory hosting, manual conversion tracking, HubSpot sharing modes, and mobile/backoffice fixes. These are product outputs or operational hardening unless tied to the R&D projects above.
Not retained as R&D:
- Client-specific chatbot triage fixes.
- Backoffice navigation and design-system polish.
- CSV export features.
- Routine HubSpot timestamp/retry fixes.
- Dependency/security bumps.
- Documentation-only updates.
Research Outcome¶
As of June 18, the strongest retained R&D work is Closed-Loop Agent Evaluation and Optimization: experiment marts and product-specific observability. Productized scraping belongs under the Compounding Context Engine for Company, Industry, and Buyer Intelligence, and relay-managed agents belong under Closed-Loop Agent Evaluation and Optimization; both remain partial until throughput and quality measurements are available. Enrichment-backed qualification is also partial pending evidence that it reduces friction without lowering data quality.