February 2026 - R&D Journal: Runtime Configuration, Interaction Routing, and Latency¶
Context¶
January left four open research questions: whether Rose should specialize through routed agents or composable skills, how to reduce first-response latency, how to keep prompts smaller without losing context, and how to validate behavior systematically.
February tested those questions through the first large wave of runtime-configurable behavior: AI Sections, form-assistant/copilot interaction modes, early agentic-system routing, fallback model behavior, and evaluation tooling. The customer-facing changelog for February documents the product outputs, but those outputs are treated here as context, not as proof of R&D eligibility.
The month advances three of the validated 2026 R&D projects: Adaptive Multi-Tenant Conversation Orchestration, Compounding Context Engine for Company, Industry, and Buyer Intelligence, and Closed-Loop Agent Evaluation and Optimization.
Adaptive Multi-Tenant Conversation Orchestration¶
Project and lock¶
The objective was to reduce response latency and prompt complexity while letting Rose route visitor actions across multiple interaction surfaces — chat, AI Sections, form assistant, inline calls-to-action, booking, and redirects. The lock was not building more UI controls: the unresolved problem was how to preserve correct agent behavior when the same visitor intent arrives from different surfaces with different context payloads and timing constraints. January's literature review had established that black-box language-model APIs foreclose training-free internal-state routing techniques, which need hidden-layer activations Rose cannot reach, so February continued down the only open routes — external routing and prompt/context reduction: explicit system routing, per-surface skill suppression, and runtime context fields.
This month's work¶
The hypothesis was that context-aware interaction routing can support new surfaces without adding a second language-model routing delay or loading every behavior into every prompt. The graph architecture was simplified so that an explicit system-routing step runs early, before behavior-specific skills load. A form-assistant interaction mode was added as a system-level skill that suppresses demo-offer and response-ending skills and follow-ups while a visitor is completing a form, keeping the agent on-task for that surface. Interaction surfaces were made explicit in routing state: an AI Section button now carries a context field identifying the action that triggered it. On latency, a streaming fallback and a change of default model protected first-response behavior when the primary path degraded, and time-to-first-token was attacked through multi-tenant instance caching and parallel retrieval (RAG) queries. The expanded/copilot layout toggle is the productization surface of the same multi-surface interaction model.
Results, proof, and next step¶
The positive learning was that the interaction surface must be an explicit part of routing state: AI Section buttons, search bars, form-assistant turns, and ordinary chat cannot be flattened into one generic message path without losing context. The negative learning was that UI mode alone is insufficient — form-assistant mode required explicit suppression of demo-offer and response-ending skills, showing that behavioral routing must happen below the visual layer. February produced qualitative latency improvements, but product-grade time-to-first-token observability was not yet in place, so the claims stay conservative until the later latency metrics land. Next step: the following month needed a formal architecture boundary for model routing and host-page routing, because the February fixes showed routing decisions were spread across language-model calls, widget state, and host-page lifecycle.
Closed-Loop Agent Evaluation and Optimization¶
Project and lock¶
The objective was to create repeatable ways to judge whether prompt and routing changes improve behavior, rather than relying on anecdotal client feedback. Language-model agent changes are hard to validate because success depends at once on relevance, format, routing, safety, and conversion behavior; February's lock was the absence of a structured evaluation path that respected the configured agentic system.
This month's work¶
The hypothesis was that routing and answer-quality changes can be made safer if evaluation output includes the routing decision and separates answer correctness from formatting quality. A command-line evaluation tool and a dataset-creation workflow were built so that routing and answer-quality changes could be judged repeatably rather than by anecdote. The evaluator was made to respect the configured agentic system and to surface the routing decision alongside the answer, so a wrong answer caused by the wrong route is visible rather than hidden. Judge scoring was then split — answer correctness separated from format quality — and the scoring formula refined, after format regressions proved hard to detect under a single blended score. Analytics events for AI Sections were added, distinguishing button interactions from search-bar interactions, to feed the evaluation with the interaction context it needs.
Results, proof, and next step¶
The first learning was that evaluation must include routing metadata: if a bad answer comes from the wrong route, scoring only the final text hides the cause. The second was that format quality needed separate scoring, because prompt changes that improve content while damaging formatting are otherwise hard to detect. February did not yet establish statistically reliable production outcome measurement — it created evaluation scaffolding rather than final proof. Next step: run the harness against larger datasets and carry its routing-aware, format-separated scoring into the production-rollout gate.
Non-R&D / Productization Context¶
The February customer changelog includes visible improvements such as AI Sections management, form assistant mode, self-service settings, copilot layout, CSP documentation, and widget bug fixes. These are important product outputs but are not automatically R&D. They are retained here only when they support an unresolved technical lock around routing, context propagation, latency, or evaluation.
Not retained as R&D:
- Logo/branding updates.
- Desktop focus, scrollbar, and layout polish.
- Routine dependency bumps and CI/release notification changes.
- Client-specific prompt/skill tuning unless it exposed reusable routing or evaluation learning.
- HubSpot integration skeleton work in February, which was primarily standard integration/product foundation.
Research Outcome¶
February advanced two R&D lines from January: interaction routing under black-box LLM constraints and evaluation scaffolding for agent behavior. The key learning was that Rose needed explicit routing state for interaction source, not just a larger prompt or more UI flags.