May 2026 - R&D Journal: Measurement Architecture, Knowledge Publishing, and Factory Validation¶

Context¶

This entry covers May 1-31, 2026. May was a platform-integration month: Rose moved from isolated feature experiments toward reusable systems for measuring agent impact, publishing generated knowledge, validating agent-produced work, and enforcing live conversation contracts.

The month advances the four validated 2026 R&D projects: Closed-Loop Agent Evaluation and Optimization, the Compounding Context Engine for Company, Industry, and Buyer Intelligence, Synthesized Knowledge Publishing for Generative-Engine Discoverability, and Adaptive Multi-Tenant Conversation Orchestration. The publishing project is split out this month from the context work: generated pages published outward to search engines are a different concern from the knowledge fed inward to the agent, even though both look like "generated FAQs". Customer-facing releases in May provide context, but they are not treated as R&D evidence by themselves. Styling, dashboard UI, individual client onboarding, CRM polish, and dependency or security maintenance are separated below as productization or non-R&D.

Closed-Loop Agent Evaluation and Optimization¶

This project has two faces that share one goal: closing the loop between what the agent does and what the team can prove about it. One face is measurement — making behavior and business impact observable with stable definitions and clean experiments. The other is the factory — making the agent-assisted work that feeds the loop (onboarding, repair, evaluation) repeatable and verifiable rather than manual and one-off. They are reported together because neither closes the loop alone: measurement without repeatable production has nothing trustworthy to measure, and a production line without measurement cannot tell whether its output improved.

Project and lock¶

A conversion in Rose is not a single observable event. The same outcome can be attributed to a chat exchange, a form the client hosts, a third-party booking provider, the destination a call-to-action points at, or a customer-relationship sync that lands hours later. When the definition of "conversion" is re-expressed independently in each reporting surface, the surfaces disagree, and no single number can be defended. The lock is sharpened by the experiment cohort: to claim an agent change caused a business effect, the exposed population must be separated cleanly from the unexposed one, and in practice that separation is polluted by internal team traffic, by visitors already exposed before the experiment began, by pages that were never eligible, and by tracking fixes shipped mid-flight.

The factory side carries a second lock. Setting up or repairing an agent for a client touches many steps — scraping and cleaning source content, editing prompts and skills, generating evaluation seeds, writing to a database, standing up a preview environment, and passing human review. The unresolved question was how to let an automated agent perform that chain without three hazards: writing unsafely to a database that is shared across environments, producing changes that exist only on a local machine and cannot be reviewed, and asserting quality that nothing actually verified. The shared-database hazard is structural — because the production database backs every environment, an environment label such as "test" or "staging" protects nothing; a write reaches real rows regardless. Entering May, the prior month's tracking and knowledge work had supplied the raw signals and increased the volume of customer-specific repair, which exposed the real uncertainty: whether the measurement layer and the production line could both be made stable enough to host experiments, rather than each absorbing another round of one-off fixes.

This month's work¶

On the measurement side, the hypothesis was that a layered database analytics architecture, paired with explicit cohort hygiene, can hold metric definitions stable and make repeatable causal measurement possible. The architectural decision recorded this month was to stop computing metrics ad hoc at the point of display and instead build a deliberate pipeline: raw events normalized in a staging layer, aggregated once into analytics marts at a fixed daily grain, refreshed in a defined order so no mart reads a half-built dependency, and checked against invariants that fail loudly when a definition drifts. The intent is a single semantic definition of each metric that every surface inherits. On top of that substrate the team built the machinery an experiment needs to be trustworthy rather than merely visible: assigning visitors to cohorts, recording exposure, computing the sample size needed to detect an effect, and detecting sample-ratio mismatch — the early warning that a split is broken and its result cannot be trusted. Cohort hygiene was made an explicit input rather than an assumption, with a counter for each exclusion reason so a clean cohort can be audited rather than asserted. The measurement methodology itself was written down, distinguishing claims a controlled A/B test can support from directional evidence that can only suggest.

On the factory side, the hypothesis was that isolating each unit of work, giving it its own throwaway database branch, exposing an explicit preview environment, and gating it behind evaluation thresholds and watchdogs can make agent-produced work safer and more repeatable than manual handoffs. The decision recorded this month defines the factory's runtime and its isolation guarantees: each task runs in its own checkout so concurrent work cannot collide, against its own database branch so writes cannot touch shared production data, with a preview environment per change so a human reviews something real rather than a description. Around that, the month tightened the end-to-end evaluation thresholds a change must clear, made the synthetic evaluation seeds idempotent so re-running does not corrupt them, added watchdogs that fail a build when the agent loader does not come up, and separated credentials per tool so one leaked key has a bounded blast radius.

Results, proof, and next step¶

The measurement learnings were that metric definitions need a semantic layer — when each dashboard owns its own query, conversion and engagement quietly diverge — and that experiments need cohort hygiene before they need better charts, because prior exposure, internal traffic, page eligibility, and mid-flight fixes change the result and must be modeled as inputs. The factory learnings were that the per-task database branch is the load-bearing safety boundary, precisely because the production database is shared across environments and a label alone protects nothing, and that factory work needs evidence gates rather than just generated code, because evaluation thresholds and loader watchdogs catch failures a human skimming a diff will miss.

The shared negative learning constrains both faces: a working surface is not evidence. Comparing engaged against non-engaged visitors, or reading a lift off a dashboard, does not establish causation; and a one-off onboarding is weak evidence of R&D unless it exercises and improves the reusable factory, scrape, or evaluation mechanisms. The most honest measure of where the project stood at the end of May is the experiment log itself: no controlled experiment was running. The substrate was used to design the first one — a held-back control in which the widget is deliberately not shown against a treatment in which it is, with form conversion as the primary metric — but that experiment remained in draft and was not launched during May. There is therefore no cohort, no exposure split, and no significance to report. May built the means to run a causal experiment and specified the first experiment to run; it did not yet run it. That result is honest, not disappointing: the project's claim this month is the substrate and the experiment design, not a measured lift.

The evidence is an architecture decision record naming the metric-drift, refresh-ordering, and raw-scan locks and proposing the layered marts and invariants; a second architecture decision record defining the factory's isolation, preview, review, and safety tradeoffs; a measurement-methodology document separating rigorous A/B claims from directional evidence; and the implementation history of the marts, the experiment machinery, the preview branches, the evaluation thresholds, and the loader watchdogs. The month's customer-facing changelog entries are product context only.

Next step: launch the designed widget-display experiment and run it to a cohort, comparing metric stability and causal-read quality against the prior dashboard-query approach; and add a control plane for delegated factory work, measuring whether tasks complete with fewer lost handoffs and fewer repeated defects.

Compounding Context Engine for Company, Industry, and Buyer Intelligence¶

This project covers the knowledge and state that flow into the agent at answer time. It is reported separately from the generated pages the platform publishes outward to search engines — those look like the same "generated knowledge" but move in the opposite direction and are tracked under Synthesized Knowledge Publishing below.

Project and lock¶

May brought several inbound knowledge forms into contact, each failing differently. Chat retrieval documents are pulled into a model's context at answer time. Tenant-maintained lookup tables encode authoritative "supported or not supported" verdicts. Long website pages must be split before they can be retrieved. Enrichment-backed visitor profiles carry inferred company and identity data. The lock was not storing more facts; it was keeping each fact source canonical, scoped to the runtime that consumes it, and safe to use downstream. The sharpest expression of the lock is the retrieval-side boundary: chat knowledge must stay scoped to retrieval and must never be contaminated by content that exists only to be published. The prior month had established the chat-retrieval store and named this boundary; May extended the same context project into lookup tables, document splitting, and enrichment-backed state, all on the inbound side.

This month's work¶

The hypothesis tested in May was that inbound knowledge compounds safely only if each surface — retrieval knowledge, lookup verdicts, and visitor intelligence — has explicit ownership and a validation boundary that keeps it out of the surfaces it does not belong to.

The team replaced free-text instructions to the model with structured lookup tables for long enumerations such as supported regions, plans, or integrations. The reasoning is that a model reading a long list as prose produces inconsistent verdicts, whereas a matcher over a structured table produces a canonical verdict plus a qualification signal downstream logic can rely on. The matching combines an exact deterministic tier with a semantic tier for paraphrase, and latches a signal once observed. The month also added heading-aware splitting of long knowledge-base pages, so a retrieved fragment still carries the context of the section it came from, and connected visitor and account enrichment to qualification and customer-relationship context so an identified visitor's inferred company can inform the conversation.

Results, proof, and next step¶

The first positive learning is that a structured lookup table beats prompt text for long supported-or-not enumerations, because the matcher yields a canonical verdict and a reusable signal rather than a fresh guess each time. The second is that the retrieval-side boundary must be enforced structurally: chat knowledge stays scoped to retrieval, and publishable content is kept out of it. The negative learning is that semantic matching needs guards — without checks for who the statement is about and for ambiguity, an incidental mention of an unsupported item can be misread as evidence that the visitor qualifies. The remaining uncertainty is that May did not establish an answer-quality gain from the new inbound structures; that can only come from evaluation feedback over time.

The evidence is a technical document defining the lookup matching, qualification, signal latching, and ambiguity controls. The month's changelog entries are product context only.

Next step: connect lookup data, chunked knowledge, and scraped content to evaluation loops so retrieval quality is measured rather than assumed.

Synthesized Knowledge Publishing for Generative-Engine Discoverability¶

This project, split out from the context work this month, covers knowledge flowing the other way: generated and published outward so search and generative engines discover and cite it. It shares a boundary with the Compounding Context Engine — published content must never leak into chat retrieval — but its consumer is an external engine, not the agent.

Project and lock¶

The platform generates answer-grade FAQ pages for clients. The naive view is that the work is text generation; the lock is that a generated page has no value until a search or generative engine trusts, crawls, and cites it. That reframes the problem from generation to web hosting and canonicalization, a domain the generator does not control. A page only earns ranking if it is served from a domain the client owns, returns canonical URLs so duplicate variants do not compete with each other, exposes a sitemap and crawl directives so engines can discover it, and notifies engines when content changes. A second face of the lock is that this published content is structurally the twin of chat-retrieval knowledge — the geo.faqs-versus-config.knowledge_faqs distinction — and the two must never be merged, or published marketing content would contaminate the agent's answers.

This month's work¶

The hypothesis tested in May was that generated knowledge can be made machine-discoverable only if hosting, canonicalization, and crawlability are treated as first-class responsibilities distinct from generation, and only if the published store is schema-separated from chat knowledge.

The hosting and canonicalization contract for these pages was documented as its own responsibility: custom-subdomain hosting on a domain the client owns, canonical URL handling, sitemap and robots serving, TLS, and IndexNow submission to notify engines of changes. The generated content was normalized and clustered so that long-tail question pages are built on a consistent representation, with question embeddings and a two-tier clustering step grouping related queries, and rendered scope-aware and bot-visible so engines can read what visitors cannot. Throughout, the published schema was kept structurally separate from chat-retrieval knowledge.

Results, proof, and next step¶

The defining learning is that generated publishing is a hosting and canonicalization problem more than a generation one: ranking and citation depend on domain ownership, redirects, sitemap behavior, and crawlable structure, none of which the page generator controls. The second learning is that the publish-versus-retrieve boundary is not a nicety but a safety property — published content is for review, static export, and indexing, and must stay out of the agent's retrieval path. The remaining uncertainty is the honest one: May shipped the publishing machinery but established no search ranking lift, no crawl or citation measurement, and no indexing outcome. The instruments to publish were built; whether the pages get discovered and cited is unmeasured and can only come from later crawl and ranking observation.

The evidence is a technical document describing the hosting and canonicalization contract for generated pages. The month's changelog entries are product context only.

Next step: instrument crawl, indexing, and citation so discoverability is measured rather than inferred from the fact that a page was published.

Adaptive Multi-Tenant Conversation Orchestration¶

Project and lock¶

This project tries to keep a conversation genuinely flexible while still enforcing business rules that must hold deterministically — which visitors get qualified, where a call-to-action sends them, how attribution is preserved, and how fast the agent answers. The tension is the lock: the agent has to ask useful questions, personalize the offer, respect exclusions, keep attribution intact, and respond quickly, all without collapsing into a brittle decision tree that no longer feels like a conversation. The question May tested was how much of that rule-enforcement can move into deterministic runtime contracts while the language model keeps handling the open-ended natural dialogue. The prior month had introduced qualification after conversion and the sequencing between booking and the customer-relationship system; May extended the same line toward qualification that does not block, call-to-action construction, and per-offer attribution.

This month's work¶

The hypothesis tested in May was that a hybrid of deterministic qualification and call-to-action contracts on one side, and language-model conversation on the other, can reduce routing errors without taking away the agent's ability to answer naturally.

The central idea was to stop treating qualification as a single hard gate. Some facts about a visitor can be gathered opportunistically in the flow of conversation without blocking the offer, while a small number of genuinely required fields still need deterministic enforcement before a specific action is allowed. The call-to-action was reframed as a state contract rather than something the answering model improvises: its destination and the attribution carried with it are assembled from frontend events, known profile fields, and routing rules, including partial pre-fill, redirect after booking, and forwarding of the attribution parameters so credit is not lost. The month also resolved cases where the agent had to trust a signal coming from the frontend rather than re-deriving intent, and where it had to avoid hijacking a support request and turning it into a sales motion. When a visitor shows high intent, arbitration now routes them to the demo offer rather than lower-priority content suggestions. A faster answer path with tighter timeout and failover behavior was added alongside.

Results, proof, and next step¶

The first positive learning is that qualification should not be one hard gate: opportunistic collection and deterministic blocking are different mechanisms and should be built as such. The second is that call-to-action behavior is a state contract among frontend events, profile fields, attribution parameters, and routing logic — handing it to the answering model produces fabricated links and lost attribution. The negative learning is that product fixes kept surfacing edge cases where the model's read of intent, the frontend's state, and the configured business rule disagreed, which is itself the signal that these rules belong in deterministic contracts.

The remaining uncertainty is the honest one: the new soft-qualification flow ran at far too low a volume in May to measure anything. Across the whole month the post-conversion qualification flow recorded only on the order of seventy starts and fewer than twenty completions, a sample from which no effect on conversion or on redundant-question reduction can be inferred. The month also ran no controlled measurement of the latency-versus-quality tradeoff. The work is promising and the contracts are in place, but the result is not yet significant, and the journal records it as such rather than reading a trend into single- and double-digit counts.

The evidence is the implementation history of the qualification, call-to-action routing, intent arbitration, booking, and answer-path changes, with the month's changelog entries serving as product context for when the behavior became visible.

Next step: let the post-qualification and call-to-action contracts accumulate enough volume to measure, then test whether they reduce wrong offers, redundant questions, and attribution loss without increasing latency or harming answer quality.

Non-R&D / Productization Context¶

May also included substantial delivery work that supports the platform but is not retained as R&D:

Published-page visual themes, style controls, settings reorganization, navigation polish, table filters, and exports.
Individual client onboarding, skill rewrites, pricing and content fixes, and support triage, unless they changed the reusable factory or evaluation process.
Standard customer-relationship integration polish, routine form-provider fixes, and dashboard presentation changes.
Dependency and security bumps, frontend cleanup, continuous-integration maintenance, and documentation-only updates.
Changelog writing and customer-facing guides, except where they document a technical constraint used as evidence above.

Research Outcome¶

May's strongest retained R&D result was the shift from dashboard-level reporting to a measurement architecture capable of hosting experiments and causal claims. The other retained results were the inbound lookup-and-chunking context work, the outbound hosting-and-canonicalization publishing machinery split out as its own project, the factory isolation-and-validation model, and the deterministic conversation contracts for qualification and call-to-action routing. Each is the enabling work that resolves its project's technical lock; what remains outstanding is the measurement of effect, not the eligibility of the work.

The main negative learning, shared across all four projects, was that a working product surface is not by itself evidence of research. The instruments were built but the readings were not yet taken: the first causal experiment was designed but left in draft rather than launched, the post-qualification flow ran at too low a volume to measure, and the generated-knowledge and factory work produced no crawl, ranking, throughput, or quality figures yet. Published pages still need crawl and ranking observations, the designed experiment still needs to be launched and run to a cohort, the factory still needs throughput and quality measurements, and the conversation contracts still need enough volume to measure. May built the instruments; the readings come later.