rose-eval (CLI)¶

The rose-eval CLI tool runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.

Installation¶

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the rose-eval command is available.

Required Environment Variables¶

Ensure these environment variables are set (typically via .env.test):

Variable	Description
`LANGFUSE_SECRET_KEY`	Langfuse API secret key
`LANGFUSE_PUBLIC_KEY`	Langfuse API public key
`LANGFUSE_HOST`	Langfuse API host URL
`AZURE_OPENAI_*`	Azure OpenAI configuration (for LLM calls)

Usage¶

rose-eval <command> <dataset> [OPTIONS]

Commands¶

Command	Description	Metric	Default Threshold
`e2e`	End-to-end API response accuracy (LLM-as-a-Judge)	Mean accuracy	0.70
`intent-classifier`	Intent classification accuracy	Macro F1	0.80
`skill-selector`	Multi-label skill selection	Skill recall + Micro F1	0.90 (recall), 0.75 (micro F1)
`suggestion`	Suggestion quality evaluation	Mean quality	0.60
`features`	List or run feature-* eval datasets	Per-dataset	Per-dataset

Shared Options¶

These options are available on all commands:

Option	Required	Default	Description
`dataset`	Yes	-	Langfuse dataset name (positional argument)
`--env`	No	`test`	Environment for credentials: `test`, `development`, `staging`, `production`
`--trace`	No	`yes`	Enable Langfuse tracing: `yes` or `no`
`--local-prompt`	No	Auto	Use local prompt files (auto: true for `test`/`development` envs)
`--item`	No	-	Filter items by ID prefix or query substring
`--threshold`	No	Per type	Override pass/fail threshold (0.0-1.0)
`--sample-size`	No	-	Run on N random items for quick testing
`--verbose`	No	`false`	Show detailed per-item results
`--json`	No	`false`	Output results as JSON (for CI, also auto-activates when stdout is piped)
`--concurrency`	No	`3`	Number of items to process concurrently
`--debug`	No	`false`	Enable debug logging to console

Command-Specific Options¶

`e2e`¶

Option	Default	Description
`--agentic-system`	`new`	Override agentic system routing: `legacy` or `new`

`skill-selector`¶

Option	Default	Description
`--micro-f1-threshold`	`0.75`	Override micro-F1 threshold

`suggestion`¶

Option	Default	Description
`--suggestion-type`	-	Filter by type: `follow-up` or `answer`
`--regenerate-model`	-	Re-generate suggestions with this model before judging (e.g., `gpt-4.1-mini`)
`--regenerate-provider`	`openai`	Provider for re-generation: `azure`, `openai`, `google`, `cerebras`, `anthropic`, `groq`, `groq_direct`

`features`¶

The features command has two sub-subcommands:

Sub-command	Description
`list`	List all feature-* evaluation datasets
`run [name]`	Run feature evaluation(s). Omit `name` to run all.

Examples¶

E2E evaluation (uses "new" agentic system by default)¶

rose-eval e2e mayday.fr-v2

E2E evaluation with explicit agentic system override¶

rose-eval e2e mayday.fr-v2 --agentic-system legacy

Quick smoke test on 5 random items¶

rose-eval e2e mayday.fr-v2 --sample-size 5

Intent classifier evaluation¶

rose-eval intent-classifier intent-classifier

Skill selector evaluation with custom threshold¶

rose-eval skill-selector skill-selector --threshold 0.85

Suggestion quality evaluation¶

rose-eval suggestion suggestion-quality
rose-eval suggestion suggestion-quality --suggestion-type follow-up
rose-eval suggestion suggestion-quality --regenerate-model gpt-4.1-mini

Feature eval datasets¶

rose-eval features list                    # List all feature-* datasets
rose-eval features run content --verbose   # Run feature-content
rose-eval features run                     # Run all feature evals

Filter to specific items¶

rose-eval e2e mayday.fr-v2 --item "Pricing" --verbose

Disable tracing for fast iteration¶

rose-eval e2e mayday.fr-v2 --trace no

JSON output for CI¶

rose-eval e2e main-dataset --json

Datasets¶

Available Datasets¶

Dataset	Command	Description
`main-dataset`	`e2e`	Main evaluation dataset
`mayday.fr-v2`	`e2e`	Mayday.fr baseline (40 single-turn + 25 multi-turn)
`intent-classifier`	`intent-classifier`	Intent classification dataset
`skill-selector`	`skill-selector`	Skill selection dataset
`suggestion-quality`	`suggestion`	Suggestion quality dataset

Creating / Populating Datasets¶

Use rose-chat with the --trace --add-to-dataset flags to add conversations to a dataset:

# Add an E2E conversation to a dataset
rose-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2

# Add to a specific dataset type
rose-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier

Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).

See CLI Chat for full details on dataset population.

Justfile Wrapper¶

A backward-compatible just eval wrapper is available for common targets:

just eval e2e-api                # → rose-eval e2e main-dataset --agentic-system legacy
just eval mayday                 # → rose-eval e2e mayday.fr-v2 --agentic-system legacy
just eval intent-classifier      # → rose-eval intent-classifier intent-classifier
just eval skill-selector         # → rose-eval skill-selector skill-selector

Prefer rose-eval directly

The just eval wrapper defaults to --trace no for backward compatibility. Use rose-eval directly for full control over tracing, agentic system selection, and all other options.

Architecture¶

flowchart TD A[rose-eval] --> B[Setup Environment] B --> C{Command} C -->|e2e| D[E2EAPIEvaluator] C -->|intent-classifier| E[ClassificationEvaluator] C -->|skill-selector| F[MultiLabelEvaluator] C -->|suggestion| G[SuggestionEvaluator] C -->|features| H[FeatureEvaluator] D --> I[E2EAPIClient] I --> J[In-Process ASGI Transport] J --> K[FastAPI / IXChat] K --> L[LangGraph Chatbot] E --> M[IntentClassifier] F --> N[SkillSelectorClassifier] G --> O[SuggestionGenerator] H --> P[FeatureChecker + E2EAPIClient] L --> Q[Langfuse Trace] D --> R[LLM-as-a-Judge] R --> S{Score vs Threshold} E --> S F --> S G --> S H --> S S -->|Pass| T[Exit 0] S -->|Fail| U[Exit 1]

Key design decisions:

In-process ASGI transport: E2E evaluations run the full API stack in-process via httpx.ASGITransport, avoiding the need for a running server.
Agentic system override: The --agentic-system flag (e2e only) flows through the API payload as forceNewAgenticSystem. When omitted, defaults to new.
LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
Classifiers in ixevaluation: E2EAPIClient, IntentClassifier, and SkillSelectorClassifier live in the ixevaluation package alongside their base classes/protocols.

Exit Codes¶

Code	Meaning
`0`	Evaluation passed (metrics above threshold)
`1`	Evaluation failed (metrics below threshold)
`2`	Error (configuration, runtime, or unknown eval type)
`130`	Cancelled by user (Ctrl+C)

Log Files¶

All evaluations are logged to backend/logs/:

eval-{timestamp}.log — Full log for each run
eval-latest.log — Symlink to the most recent log file

The log file path is displayed at the start of each run.

CLI Chat — Interactive chatbot testing and dataset population
CLI Langfuse — Manage Langfuse prompts, datasets, and traces
Backend Setup — Environment configuration
IXChat Package — Chatbot and graph structure details
Skill System — Skill definitions and registry