Skip to content

Research Papers

Internal R&D papers built from production Rose data using the PaperOrchestra 5-agent pipeline. Sources, BibTeX, telemetry dumps, and intermediate pipeline artefacts live at the repository root under research-papers/.

Compiled PDFs are gitignored (regenerable in ≈30 s via tectonic); canonical source is each paper's paper.tex. From research-papers/:

just build <slug>          # paper.tex → paper.pdf
just open <slug>           # open PDF
just build-all             # build every paper

Catalogue

multi-tenant-skill-architecture

Two-Step Skill Loading with Hard-Rule Enforcement and Per-Tenant Overrides: A Skill-Centric Agent Architecture for Multi-Tenant B2B Chat

Position / systems paper describing the six co-designed mechanisms behind Rose's deployed skill system: description-only LLM selector, deterministic rule applier (load / unload / swap), turn-gated availability, post-LLM dependency and exclusion resolution, per-tenant skill overrides, and runtime description enrichment. Grounded in production code (backend/packages/ixchat + backend/packages/ixskills) and Langfuse + PostHog telemetry around the 2026-03-02 rollout.

Key findings: selector input ≈ 18× smaller than answer-writer call; answer-writer latency −8.3 % avg / −6.9 % p95 across rollout; applier modifies selector output in only 1 % of turns (rules act as guarantee layer, not correction). 13 global SKILL.md + 84 client-specific across 20+ tenants.

Tracks IX-3055 for the requires / blocks metadata wiring.

Status: Drafted (iter0, gates passed). Author: Benoit Pothier.

deferred-execution

LangGraph deferred-execution pattern: pre/post natural experiment on production chat latency.

Measures the LangGraph deferred-execution pattern's effect on end-to-end chat latency using a pre/post natural experiment around the rollout. Status: Drafted (iter3, score 72.8).

skill-selector-distillation

Distilling the LLM-based skill-selector into a TF-IDF / MiniLM classifier; tests whether lexical features suffice for production routing.

  • Hypothesis (a): a linear multi-label classifier trained on LLM-produced pseudo-labels matches the LLM's skill-recall on a human-labeled held-out set, cutting per-turn selector latency by three orders of magnitude.
  • Hypothesis (b): TF-IDF lexical features are sufficient — adding a pretrained MiniLM sentence encoder does not improve human-labeled recall.

Status: Drafted (v2 revision; pipeline v1).

Layout convention

Each paper subdirectory follows the same structure:

research-papers/<slug>/
├── README.md                   Paper claim, layout, reproduction steps.
├── paper.tex                   Canonical LaTeX source.
├── refs.bib                    BibTeX entries.
├── figures/                    PNG figures + captions.json.
├── inputs/                     PaperOrchestra inputs: idea, log, template, guidelines.
├── pipeline/                   Intermediate artefacts: outline, citation pool,
│                               provenance, drafts, refinement snapshots.
└── telemetry/ or data/         Production data dumps grounding the empirical claims.

See research-papers/README.md for the workflow to add a new paper.