January 2026 - Literature Review: Multi-Agent Orchestration Strategies¶
Context¶
Following the successful proof-of-concept of the intent router and multi-agent architecture in December 2025, January's R&D efforts focused on a comprehensive literature review to inform the next phase of development. The goal was to identify optimal orchestration strategies that balance latency, security, and computational efficiency.
The December 2025 implementation validated the core hypothesis that intent-based routing to specialized agents is architecturally feasible. However, several technical uncertainties remained regarding the optimal approach for production-scale deployment:
- How to minimize Time to First Token (TTFT) while maintaining routing intelligence?
- How to secure the system against indirect prompt injection in retrieved documents?
- How to achieve streaming concurrency without sacrificing response quality?
Active clients during this period included AB Tasty, Pennylane, Skaleet, Skello, and the Rose website.
Technical Challenge: Orchestration Strategy Selection¶
The core technical uncertainty addressed by this literature review concerns the selection of an optimal orchestration architecture for multi-agent LLM systems. The challenge involves balancing multiple competing objectives:
Time to First Token (TTFT): The latency before the first response token is generated. This metric is critical for user experience, as perceived responsiveness depends heavily on initial response time rather than total generation time.
Streaming Concurrency: The ability to perform useful work (tool calls, retrieval, reasoning) while the LLM is generating tokens, rather than waiting for complete responses.
Security: Protection against prompt injection attacks, particularly indirect injection through retrieved documents or tool outputs that may contain malicious instructions.
Computational Efficiency: Token usage and LLM call overhead, which directly impact operational costs and scalability.
Response Quality: The amount of context provided to the LLM directly impacts response quality. Research demonstrates that:
- Reduced context leads to fewer hallucinations: When LLMs process less irrelevant information, they are less likely to generate confabulated content mixing unrelated concepts.
- Smaller prompts improve instruction adherence: LLMs demonstrate reduced instruction-following accuracy when presented with long, complex prompts containing multiple conditional branches. Focused, task-specific prompts yield better compliance with stated requirements.
- Cognitive load parallel: Similar to human cognition, LLMs perform better when given clear, focused instructions rather than comprehensive but overwhelming context.
Scientific Question: Given the tradeoffs between monolithic prompts, dedicated agents, dynamic skill injection, and streaming approaches, which orchestration strategy or combination of strategies is optimal for a production multi-tenant conversational AI system?
State of the Art Gap: While individual techniques have been studied in isolation, no comprehensive framework exists for selecting and combining orchestration strategies based on specific system requirements. The December 2025 implementation used dedicated agents with LLM-based intent routing, but this represents only one point in the design space.
Literature Review: Orchestration Strategy Comparison¶
Strategy Comparison Matrix¶
The following table summarizes the key orchestration strategies identified in the literature, evaluated against the critical metrics for our system:
| Strategy | TTFT Impact | Total Latency | Best Use Case | Key Trade-off |
|---|---|---|---|---|
| Monolithic "Mega-Prompt" | Poor (O(n²) prefill) | High | Simple tasks; small token count | Simplicity vs. scalability |
| Dedicated Agents (One per Task) | Good (if router is fast) | Low | Disparate tasks with no overlap | Specialization vs. routing overhead |
| Skill System (Dynamic Injection) | Moderate | Low | Compositional tasks; high reuse | Flexibility vs. complexity |
| GhostShell (Streaming Function Calls) | Excellent | Optimal | Real-time tasks; embodied AI | Speed vs. implementation complexity |
| Statistical Routing (NormStat/VecStat) | Optimal | Minimal | High-throughput systems | Speed vs. classification accuracy |
December 2025 - January 2026 Implementation (Current Work)¶
The current implementation explores the Dedicated Agents strategy with LLM-based intent classification. Key characteristics:
- Intent classifier designed to run in parallel with other analysis nodes (knowledge retrieval, interest signals detection)
- Action router performs deterministic priority-based routing
- Multi-agent parallel architecture with 5 nodes in first superstep, convergence at action router (implemented January 2026)
Status: POC IN PROGRESS. The architecture design was validated in December 2025. The multi-agent parallel implementation is ongoing in January 2026. Dedicated agents are not yet fully implemented—this remains a hypothesis to test against the skill system alternative. Full A/B testing with statistical significance analysis is planned for Q1 2026.
Research Area 1: Low-Latency Routing¶
Training-Free Statistical Intent Classification (Literature Review)¶
The literature identifies a promising approach to eliminate the "double TTFT" problem (waiting for a router LLM call, then waiting for the main agent). Methods such as NormStat and VecStat analyze the internal hidden features of the LLM during the initial forward pass.
NormStat uses Gaussian KL divergence on the activation patterns (Wℓzℓ,t) to categorize user intent with negligible overhead. VecStat uses cosine distance in the representation space for similar classification.
Key Insight: Instead of a second LLM call for routing, these methods piggyback on the prefill phase of the main model, providing classification "for free" in terms of latency.
Label Space Reduction: When many skills or agents exist, statistical routing can reduce the set of potential instructions provided to the LLM. This reduces the tokens the LLM must process, improving both performance and accuracy by removing irrelevant context.
Limitation: Black-Box API Constraint¶
Critical Finding: NormStat and VecStat methods cannot be implemented in our architecture. These approaches require access to the model's internal representations and hidden layer activations during the prefill phase. Since Rose uses black-box LLM APIs (Azure OpenAI, OpenRouter), we have no access to these internal features.
Alternative Approaches for Black-Box APIs¶
Given the black-box constraint, the literature suggests two alternative approaches for fast intent classification:
1. Semantic Matching with Embedding Models
Use a separate embedding model (such as Sentence-BERT or similar) to match user queries against skill/agent descriptions through vector similarity. This approach:
- Adds minimal latency (~10-50ms for embedding generation)
- Requires no access to main LLM internals
- Can run fully in parallel with other operations
- Scales well with many skills/agents
2. Lightweight External Intent Classifier
Use a fine-tuned BERT-class model or a very small SLM (Small Language Model) specifically trained for intent classification. This approach:
- Provides higher accuracy than embedding similarity for complex intents
- Adds ~50-200ms latency depending on model size
- Can be fine-tuned on domain-specific conversation data
- Runs independently of the main LLM
Applicability to Rose¶
Our current implementation uses an LLM-based intent classifier (Claude Haiku) that runs in parallel with other nodes. For further optimization, the black-box alternatives offer viable paths:
- Semantic matching could replace LLM routing for simple intent categories
- Fine-tuned classifier could handle nuanced intents with lower cost than LLM calls
However, both alternatives require training/fine-tuning on labeled conversation data, representing a different tradeoff than the training-free statistical methods described in the literature.
Research Area 2: Modular Prompting & Structural Tagging¶
Modular Prompt Architectures¶
The literature describes treating prompt snippets as Modular Prompt Architectures using structured, HTML-like tags to delineate different components:
Atomic Decomposition: Breaking goals into 3-4 distinct sub-tasks (atomic units) that can be addressed by specialized prompt modules.
Tagging System: Using tags like <role>, <input_data>, <requirements>, and <output_format> to clearly separate trusted instructions from untrusted user data. This ensures the model treats injected snippets as high-priority constraints rather than suggestions.
Prompt Chaining: Using the output of one module (e.g., "Identify Intent") as the input for the next skill module, enabling compositional prompt construction.
Quality Benefits: Modular prompts with reduced context directly address the response quality factors identified above—fewer hallucinations and better instruction adherence through focused, task-specific instructions.
Rose ADR: Prompt Modularization (December 2025)¶
The ADR 2025-12-19: Prompt Modularization defines our planned approach to modular prompts (not yet fully implemented):
- 3-Level Hierarchy: Meta-template → Agent template → Client instructions
- Specialized Agents: 6 agent types (Educator, Qualifier, CTA, Support, Off-topic, Other)
- Variable Naming Convention:
{source}_{scope}_{name}(e.g.,lf_agent_instructions,db_client_company,rt_session_language) - Embedded vs Injected: Role, tone, formatting, guardrails embedded directly; agent-specific and client-specific content injected
This architecture represents the Dedicated Agents strategy from the comparison matrix, with modular prompt assembly. However, it remains a hypothesis—the skill system alternative may prove more scalable.
Key Research Question: Dedicated Agents vs. Skill System¶
The December 2025 ADR chose dedicated agents (one per intent category). However, the literature review raises an important question for further investigation:
The Combinatorial Problem: When behavioral dimensions multiply (response style × domain expertise × client customization × conversation phase), dedicated agents create a combinatorial explosion. For example:
- 4 response styles × 3 domain specializations × 10 clients = 120 potential agent configurations
- Each combination would require a separate prompt, defeating the modularization goal
Alternative: Skill System (Dynamic Injection)
Systems like Claude Code use a skill-based architecture where:
- Skills are composable snippets: Small, focused instruction modules that can be combined dynamically
- Runtime composition: Skills are injected based on detected context, not routed to separate agents
- Dimensional independence: Response style skills, domain skills, and client skills can be combined orthogonally
- Single agent with dynamic capabilities: One agent dynamically equipped with relevant skills
Why Skill Systems Are Superior for Complex Multi-Tenant Systems:
- Avoids "Instruction Collision": Massive monolithic prompts are brittle; the model often gets confused or ignores instructions when they are crammed together
- Scalability: As complexity grows, new skill snippets are added to a library rather than performing "complex surgery" on a single mega-prompt
- Contextual Reliability: Modular systems provide a "contract" with the model using structured tags, ensuring it understands which part of the prompt is a trusted instruction versus raw input data
Trade-off Analysis:
| Approach | Pros | Cons |
|---|---|---|
| Dedicated Agents | Clear separation, predictable behavior, easier testing | Combinatorial explosion, duplication, routing overhead |
| Skill System | Composable, scales with dimensions, single routing decision | Complexity in skill composition, potential instruction conflicts |
Optimized Skill System: Two-Layer Hybrid Routing¶
Since we use black-box APIs (no internal model access), the literature recommends a two-layer hybrid routing strategy to maintain fast TTFT:
Layer 1: Lightweight Intent Routing
To avoid the "double TTFT" of using a premium LLM to route to skills:
- Keyword/Regex Filters: High-confidence terms trigger specific skills instantly with zero LLM overhead
- Small Language Model (SLM) Router: Fine-tuned BERT or embedding similarity search (Route0x) handles ~98% of intent detection with minimal latency
- Escalation Only When Needed: Premium model called only for execution, not routing
Layer 2: Decision Token Pattern (MediaTek Research)
As the first step in the injected prompt, instruct the LLM to output a Decision Token (e.g., <|use_tool|> or <|answer|>):
- Forces the model to decide if injected skills are relevant before generating reasoning text
- If the user asks a simple question not requiring a skill, the model outputs
<|answer|>and starts immediately - Results in the lowest possible TTFT for simple queries
BUTTON Pipeline: Bottom-Up Then Top-Down¶
The literature describes a structured approach for building skill-based systems:
- Bottom-Up: Build interactions from atomic tasks (small, focused skills)
- Top-Down: Use a supervisor agent to orchestrate skill sequences
- Parallelization: Instruct the model to call multiple skills in a single turn if independent, reducing total end-to-end latency
Latency Comparison (Updated with Skill System)¶
| Strategy | TTFT Performance | Reasoning |
|---|---|---|
| Monolithic Prompt | Poor | Prefill cost is O(n²); large prompt delays first token significantly |
| Dedicated Agents | Moderate | Requires "Router" call first; if router is LLM, TTFT is paid twice |
| Skill System (RAG-based) | Good | Injects only Top-K relevant snippets; prefill stays small |
| Skill System + Decision Token | Very Good | Decision token allows bypassing skill injection for simple queries |
| GhostShell (Streaming) | Optimal | Parses function tokens while streaming, starts action before generation ends |
Analogy: A Dedicated Agent system is like a hospital where you wait at the front desk (router) before being sent to a specialist. A Skill System is like a trauma surgeon with a modular toolkit at their side. The Decision Token is the surgeon's immediate glance: they decide instantly whether they need a scalpel (a skill) or can simply apply pressure (internal knowledge). By not leaving the room, treatment starts faster.
Open Question for Q1 2026: Should we evolve from dedicated agents toward a skill system to better handle multi-dimensional customization? This would require:
- Defining skill boundaries and composition rules
- Implementing the two-layer hybrid routing (SLM + Decision Token)
- Building the BUTTON pipeline for skill orchestration
- Handling potential conflicts between composed skills
- Measuring quality impact vs. dedicated agents
Applicability to Rose¶
The dedicated agents approach (ADR approved, not yet implemented) targets the current scope of 6 intent categories. However, as client customization requirements grow more complex (industry-specific behaviors, regional variations, product-specific knowledge), a skill system may offer better scalability. The two-layer hybrid routing addresses the "double TTFT" concern that would otherwise make skill systems slower than dedicated agents.
This is the key research question for Q1 2026: Should we implement the dedicated agents as designed in the ADR, or pivot to a skill system before full implementation?
Research Area 3: Streaming Concurrency (GhostShell Methodology)¶
Reasoning-While-Acting¶
The most significant finding for latency optimization is the shift from sequential execution to Reasoning-While-Acting. The GhostShell methodology demonstrates that function calls can be issued incrementally as tokens stream from the LLM.
Implementation: By using an XML function token parser, tool calls can be triggered the moment a closing tag is detected, rather than waiting for full response generation. This approach achieves up to 66x faster response times than native function calling APIs.
Parallelization & Operator Fusion: Independent subtasks can execute simultaneously across different instances. "Fusing" operators instructs the model to apply multiple logic steps and output a single JSON object in one pass, reducing repeated initialization costs.
Speculative Execution: Starting a "generation" stage using partial search results while the primary retrieval process is still finishing.
Implementation Status¶
Speculative Document Retrieval: We have implemented speculative document retrieval in the retrieval node. The system speculatively starts retrieval operations and uses the results only if needed, reducing wait time on the critical path.
Speculative Generation: Not yet implemented. This would involve starting response generation with partial context, then incorporating full retrieval results as they become available.
Implementation Status Summary¶
| Technique | Status | Notes |
|---|---|---|
| Intent Router (LLM-based) | POC (Dec 2025) | Design validated, not yet in production |
| Multi-Agent Parallel Architecture | In Progress (Jan 2026) | 5 parallel nodes in superstep 1 |
| Dedicated Agents (6 types) | Hypothesis | ADR approved, not yet implemented; comparing vs skill system |
| Speculative Document Retrieval | Implemented (Jan 2026) | In retrieval node |
| Statistical Routing (NormStat/VecStat) | Not Applicable | Requires internal model access; black-box API constraint |
| Embedding-Based Routing | Research only | Hypothesis A - alternative for black-box APIs |
| Skill System (Dynamic Injection) | POC (Jan 2026) | Hypothesis B - single-turn multi-skill injection |
| GhostShell Streaming | Research only | Hypothesis C for Q1 2026 |
| Context Reduction Quality Impact | Research only | Hypothesis D for Q1 2026 |
Hypotheses for Q1 2026 Testing¶
Based on this literature review, the following hypotheses are proposed for experimental validation in Q1 2026:
Hypothesis A: Embedding-Based Routing for Latency Optimization¶
"Semantic matching using embedding models (Sentence-BERT or similar) for intent classification will reduce routing latency and cost compared to LLM-based intent classification, while maintaining acceptable classification accuracy."
Technical Uncertainty: Can embedding similarity achieve sufficient accuracy for production intent classification in conversational AI, given that we cannot use internal-access methods like NormStat/VecStat with black-box APIs?
Validation Method: Implement embedding-based routing and compare latency, cost, and classification accuracy against current LLM-based approach (Claude Haiku) using production traffic.
Based on: "Intent Detection in the Age of LLMs", Deepchecks on AI Agent Routers
Hypothesis B: Skill System vs. Dedicated Agents vs. Monolithic Prompt¶
"A skill-based architecture with dynamic injection will scale better than both dedicated agents and monolithic prompts when handling multi-dimensional customization (response style × domain × client), while maintaining equivalent or better response quality."
Technical Uncertainty: Can dynamically composed skills maintain coherent behavior without instruction conflicts, and what is the quality impact compared to dedicated agent prompts and monolithic prompts?
Validation Method:
- Design skill taxonomy and composition rules
- Implement prototype skill injection system
- Compare response quality, customization effort, and scalability metrics against both dedicated agents and current monolithic prompts
Based on: Claude Code skill architecture, "Modular and Hybrid Frameworks for LLM-Based Agents in Slay the Spire", OptizenApp Modular Prompting framework
Reference: ADR 2025-12-19: Prompt Modularization
Hypothesis C: Streaming Function Calls (GhostShell-inspired)¶
"Issuing tool calls incrementally as tokens stream will reduce total response latency compared to waiting for complete response generation."
Technical Uncertainty: Can streaming function call parsing be reliably implemented with production LLM APIs, and what is the actual latency improvement in conversational (vs. embodied) AI contexts?
Validation Method: Implement XML-based streaming function parser and measure total response latency compared to current sequential execution.
Based on: "GhostShell: Streaming LLM Function Calls for Concurrent Embodied Programming" (reported up to 66x improvement)
Hypothesis D: Context Reduction for Quality Improvement¶
"Reducing prompt context by routing to focused, task-specific prompts will measurably improve instruction adherence and reduce hallucination rates compared to monolithic prompts."
Technical Uncertainty: What is the quantifiable relationship between prompt size and response quality metrics (hallucination rate, instruction adherence) in production conversational AI?
Validation Method: Compare response quality metrics between monolithic prompts (~2000 tokens) and focused agent prompts (~400 tokens) across matched conversation samples.
Based on: "Agentic CBR in Action" (hallucination reduction through anchoring), Research on instruction-following degradation in long prompts
Research Area 4: Prompt Modularization Implementation¶
This research area documents the implementation work to evolve beyond the monolithic prompt. Two architectural approaches were developed and compared.
Work Done: Two Approaches Implemented¶
1. Dedicated Agents with 3-Level Template Hierarchy
Described in ADR 2025-12-19:
- Level 1 - Meta-template: Common structure shared by all agents (role, tone, formatting, guardrails)
- Level 2 - Agent template: Specialized instructions for each of 6 agent types (Educator, Qualifier, CTA, Support, Off-topic, Other)
- Level 3 - Client instructions: Per-client customization injected into the agent template
The intent classifier routes each conversation turn to the appropriate agent. One agent handles each turn.
2. Skill System with Single-Context Aggregation
An alternative approach where:
- Skills as files: Each skill is a SKILL.md file with YAML frontmatter (name, description, category, conditions, dependencies) and markdown instructions
- LLM-based selection: A fast LLM selects which skills are relevant for the current context
- Heuristics refinement: Deterministic rules improve the LLM selection (exclusivity, frequency limits, auto-injection of system skills)
- Single-context aggregation: ALL selected skills are combined and injected into ONE LLM call
Technical Uncertainty¶
The core question: How to handle multi-dimensional customization (response style × domain expertise × client customization × conversation phase) without combinatorial explosion?
Dedicated agents challenge: N agents × M clients × P contexts = potentially hundreds of prompt configurations to maintain.
Skill system challenge: Can a single LLM call process multiple skills simultaneously while maintaining coherent reasoning?
Latency constraint: Both approaches must meet TTFT requirements. Skill selection adds a routing step that must be fast (addressed with Claude Haiku + heuristics).
Hypothesis¶
"A skill-based architecture with dynamic injection into a single LLM context will scale better than dedicated agents when handling multi-dimensional customization, while achieving equivalent or better response quality."
How This Differs from Literature¶
The literature (Research Area 2) describes approaches like HTML-like structural tagging (<role>, <requirements>) and multi-agent sequential orchestration (one LLM call per agent/skill).
Our skill system differs in one key aspect: All selected skills are aggregated into a single prompt and processed in ONE LLM call. This enables "compound intelligence" where skills can inform each other within the same reasoning context, rather than being isolated in separate calls.
Results¶
[PLACEHOLDER - TO COMPLETE AFTER A/B TESTING]
The architecture has been implemented and deployed. Quantitative validation is pending A/B testing.
Metrics to measure:
- Response latency comparison vs. monolithic prompt (target: equivalent or lower)
- Instruction adherence rate via LLM-as-judge evaluation (target: >90%)
- Skill activation accuracy—correct skills selected for context (target: >85%)
- User satisfaction metrics from PostHog analytics
Next Steps¶
- A/B testing of skill system vs. legacy monolithic prompt with statistical significance analysis
- Development of measurement framework for instruction adherence using LLM-as-judge
- Expansion of skill library for additional use cases (industry-specific, product-specific)
- Fine-tuning of heuristics layer based on production performance data
R&D Activities Summary¶
The following R&D activities were conducted during January 2026:
- Literature Review: Comprehensive analysis of 31 sources (21 academic papers, 10 industry guides) on LLM orchestration strategies
- Strategy Comparison: Systematic evaluation of 5 orchestration approaches against TTFT, latency, security, and efficiency criteria
- Architectural Analysis: Deep dive into Dedicated Agents vs. Skill System trade-offs, including two-layer hybrid routing strategy
- Black-Box API Constraints: Identified that statistical routing (NormStat/VecStat) is not applicable; documented embedding-based alternatives
- Hypothesis Formulation: Development of 4 testable hypotheses for Q1 2026 experimental work
- Multi-Agent Parallel Implementation: Ongoing implementation of 5-node parallel architecture in LangGraph
- Speculative Retrieval Implementation: Implementation of speculative document retrieval in the retrieval node
- Skill System POC: Implementation of single-turn multi-skill injection architecture to validate Hypothesis B (Research Area 4)
Other Development (Non-R&D)¶
The following activities represent standard software development work:
- Documentation updates
- Bug fixes and maintenance
- Client configuration updates
Sources¶
Academic Papers¶
-
"Fast Intent Classification for LLM Routing via Statistical Analysis of Representations" - Introduces NormStat and VecStat methods for training-free intent classification during the prefill phase with negligible latency overhead.
-
"GhostShell: Streaming LLM Function Calls for Concurrent Embodied Programming" - Presents methodology for issuing function calls incrementally as tokens stream, achieving up to 66x faster response times than native APIs.
-
"Agentic CBR in Action: Empowering Loan Approvals Through Interactive, Counterfactual Explanations" - Explores Case-Based Reasoning (CBR) to anchor LLM responses and reduce hallucinations in financial decision-making.
-
"Architecting Large Action Models for Human-in-the-Loop Intelligent Robots" - Details modular neuro-symbolic architecture for grounding LLM reasoning in physical robotic actions.
-
"ComplexFuncBench & ComplexEval" - Benchmark and framework for evaluating multi-step and constrained function calling under long-context scenarios.
-
"Continuous Prompts (CPs)" - Framework for LLM-augmented stream processing, extending RAG to continuous, stateful semantic operators.
-
"Exploring Multimodal Collaborative Storytelling with Pepper" - Study on zero-shot LLMs for interactive Human-Robot Interaction (HRI).
-
"Granite-Function Calling Model" - Research on training models on granular sub-tasks (parameter detection, function name detection) to improve generalizability.
-
"HedraRAG" - Co-designed system for coordinating LLM generation and database retrieval to minimize pipeline stalls in heterogeneous RAG serving.
-
"Intent Detection in the Age of LLMs" - Evaluates SOTA LLMs for intent detection, proposes hybrid systems using uncertainty-based routing.
-
"Intent Recognition and Out-of-Scope Detection in Multi-party Conversations" - Uses BERT and LLMs in zero-shot settings to reduce label space for classification.
-
"Kairos" - Multi-agent serving system optimizing end-to-end latency through workflow-aware priority scheduling and memory-aware dispatching.
-
"MINT" - Benchmark evaluating LLMs in multi-turn interactions with tools and natural language feedback.
-
"Modular and Hybrid Frameworks for LLM-Based Agents in Slay the Spire" - Demonstrates task decomposition and specialized prompts outperform monolithic agents in complex strategy games.
-
"Self-Organizing Agent Network (SOAN)" - Proposes structure-driven orchestration for automating deeply nested business workflows.
-
"StreamingRAG" - Framework for real-time video understanding using evolving knowledge graphs and lightweight models.
-
"Tool Learning with Large Language Models: A Survey" - Comprehensive review of tool learning implementation in LLMs.
-
"Towards Resource-Efficient Multimodal Intelligence" - Investigates learned routing among specialized experts to balance cost and quality.
-
Route0x - Embedding similarity-based routing system that handles ~98% of intent detection tasks with minimal latency, avoiding LLM routing overhead.
-
MediaTek Research: Decision Token Pattern - Technique for forcing LLM to output a decision token (
<|use_tool|>or<|answer|>) as first step, enabling immediate response for simple queries that don't require tool/skill invocation. -
BUTTON Pipeline (Bottom-Up Then Top-Down) - Methodology for building complex LLM systems from atomic tasks (Bottom-Up) orchestrated by a supervisor agent (Top-Down), with parallelization of independent skill calls.
Technical Industry Documentation¶
-
Adaptive Live - Documents 5 essential design patterns: Chaining, Routing, Parallelization, Orchestrator-Worker, Evaluator-Optimizer.
-
BentoML & NVIDIA Technical Documentation - Defines key performance metrics: TTFT (Time to First Token), TPOT (Time per Output Token), ITL (Inter-token Latency).
-
Deepchecks on AI Agent Routers - Analyzes intent classification, semantic matching, and context-aware routing techniques.
-
Evidently AI & Witness AI - Detailed guides on prompt injection risks (direct vs. indirect) and layered security strategies.
-
Gladia on Concurrent Pipelines - Lessons from live deployment of real-time voice AI, emphasizing reasoning-while-acting.
-
Hivenet Practical Guide - Guide for LLM inference in production, highlighting hardware choices and serving architectures.
-
Microsoft MSRC - Details Microsoft's defense-in-depth approach against indirect prompt injection, including Spotlighting and Prompt Shields.
-
NVIDIA on Securing the KV Cache - Examines how caching optimizations can be exploited as timing side-channels and mitigation strategies.
-
OptizenApp on Modular Prompting - Practical framework for building scalable systems using HTML-like tags (
<role>,<requirements>) to delineate instructions. -
DAIR.AI Prompt Engineering Guide - Comprehensive guide on agent components, memory formats, and tool usage frameworks.
Next Steps (Q1 2026)¶
-
Prioritize Hypotheses: Evaluate implementation complexity and expected impact to determine testing order for the 4 hypotheses
-
Key Architectural Decision: Skill System vs. Dedicated Agents vs. Monolithic Prompt
-
Design skill taxonomy if pursuing skill system approach
- Prototype skill composition and injection mechanism
-
Compare scalability against both dedicated agents and current monolithic prompts
-
Embedding-Based Routing POC: Implement semantic matching with embedding models as alternative to LLM-based routing (black-box compatible)
-
Quality Measurement Framework: Establish metrics and evaluation methodology for measuring hallucination rates and instruction adherence
-
Complete A/B Testing: Finalize A/B test for December's intent router implementation
Appendix: Literature Search Template for Skill Injection Research¶
The following search queries can be used to identify relevant academic and industry literature for the skill injection approach documented in Research Area 4:
Academic Search Queries¶
Dynamic Prompt/Skill Injection:
- "dynamic prompt injection large language models"
- "skill composition LLM agents"
- "modular prompt engineering transformer"
- "instruction injection neural language models"
Multi-Agent Orchestration:
- "multi-agent LLM orchestration latency"
- "agent coordination language models"
- "sequential vs parallel agent execution LLM"
- "compound AI systems orchestration"
Context Optimization:
- "instruction following long context LLM"
- "prompt length instruction adherence"
- "cognitive load large language models"
- "context window optimization transformer"
Few-Shot Learning & Dynamic Prompting:
- "few-shot prompting composition"
- "dynamic in-context learning"
- "retrieval-augmented prompting"
- "adaptive prompt selection"
Industry/Technical Sources¶
- Anthropic Claude documentation (skill/tool usage patterns)
- OpenAI function calling best practices
- LangChain/LangGraph agent documentation
- Hugging Face agent tutorials and benchmarks
Key Conferences to Monitor¶
- NeurIPS (Neural Information Processing Systems)
- ACL (Association for Computational Linguistics)
- EMNLP (Empirical Methods in NLP)
- ICLR (International Conference on Learning Representations)
- NAACL (North American Chapter of ACL)