Skip to content

rose-eval (CLI)

The rose-eval CLI tool runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.

Installation

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the rose-eval command is available.

Required Environment Variables

Ensure these environment variables are set (typically via .env.test):

Variable Description
LANGFUSE_SECRET_KEY Langfuse API secret key
LANGFUSE_PUBLIC_KEY Langfuse API public key
LANGFUSE_HOST Langfuse API host URL
AZURE_OPENAI_* Azure OpenAI configuration (for LLM calls)

Usage

rose-eval <command> <dataset> [OPTIONS]

Commands

Command Description Metric Default Threshold
e2e End-to-end API response accuracy (LLM-as-a-Judge) Mean accuracy 0.70
intent-classifier Intent classification accuracy Macro F1 0.80
skill-selector Multi-label skill selection Skill recall + Micro F1 0.90 (recall), 0.75 (micro F1)
suggestion Suggestion quality evaluation Mean quality 0.60
features List or run feature-* eval datasets Per-dataset Per-dataset

Shared Options

These options are available on all commands:

Option Required Default Description
dataset Yes - Langfuse dataset name (positional argument)
--env No test Environment for credentials: test, development, staging, production
--trace No yes Enable Langfuse tracing: yes or no
--local-prompt No Auto Use local prompt files (auto: true for test/development envs)
--item No - Filter items by ID prefix or query substring
--threshold No Per type Override pass/fail threshold (0.0-1.0)
--sample-size No - Run on N random items for quick testing
--verbose No false Show detailed per-item results
--json No false Output results as JSON (for CI, also auto-activates when stdout is piped)
--concurrency No 3 Number of items to process concurrently
--debug No false Enable debug logging to console

Command-Specific Options

e2e

Option Default Description
--agentic-system new Override agentic system routing: legacy or new

skill-selector

Option Default Description
--micro-f1-threshold 0.75 Override micro-F1 threshold

suggestion

Option Default Description
--suggestion-type - Filter by type: follow-up or answer
--regenerate-model - Re-generate suggestions with this model before judging (e.g., gpt-4.1-mini)
--regenerate-provider openai Provider for re-generation: azure, openai, google, cerebras, anthropic, groq, groq_direct

features

The features command has two sub-subcommands:

Sub-command Description
list List all feature-* evaluation datasets
run [name] Run feature evaluation(s). Omit name to run all.

Examples

E2E evaluation (uses "new" agentic system by default)

rose-eval e2e mayday.fr-v2

E2E evaluation with explicit agentic system override

rose-eval e2e mayday.fr-v2 --agentic-system legacy

Quick smoke test on 5 random items

rose-eval e2e mayday.fr-v2 --sample-size 5

Intent classifier evaluation

rose-eval intent-classifier intent-classifier

Skill selector evaluation with custom threshold

rose-eval skill-selector skill-selector --threshold 0.85

Suggestion quality evaluation

rose-eval suggestion suggestion-quality
rose-eval suggestion suggestion-quality --suggestion-type follow-up
rose-eval suggestion suggestion-quality --regenerate-model gpt-4.1-mini

Feature eval datasets

rose-eval features list                    # List all feature-* datasets
rose-eval features run content --verbose   # Run feature-content
rose-eval features run                     # Run all feature evals

Filter to specific items

rose-eval e2e mayday.fr-v2 --item "Pricing" --verbose

Disable tracing for fast iteration

rose-eval e2e mayday.fr-v2 --trace no

JSON output for CI

rose-eval e2e main-dataset --json

Datasets

Available Datasets

Dataset Command Description
main-dataset e2e Main evaluation dataset
mayday.fr-v2 e2e Mayday.fr baseline (40 single-turn + 25 multi-turn)
intent-classifier intent-classifier Intent classification dataset
skill-selector skill-selector Skill selection dataset
suggestion-quality suggestion Suggestion quality dataset

Creating / Populating Datasets

Use rose-chat with the --trace --add-to-dataset flags to add conversations to a dataset:

# Add an E2E conversation to a dataset
rose-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2

# Add to a specific dataset type
rose-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier

Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).

See CLI Chat for full details on dataset population.


Justfile Wrapper

A backward-compatible just eval wrapper is available for common targets:

just eval e2e-api                # → rose-eval e2e main-dataset --agentic-system legacy
just eval mayday                 # → rose-eval e2e mayday.fr-v2 --agentic-system legacy
just eval intent-classifier      # → rose-eval intent-classifier intent-classifier
just eval skill-selector         # → rose-eval skill-selector skill-selector

Prefer rose-eval directly

The just eval wrapper defaults to --trace no for backward compatibility. Use rose-eval directly for full control over tracing, agentic system selection, and all other options.


Architecture

flowchart TD A[rose-eval] --> B[Setup Environment] B --> C{Command} C -->|e2e| D[E2EAPIEvaluator] C -->|intent-classifier| E[ClassificationEvaluator] C -->|skill-selector| F[MultiLabelEvaluator] C -->|suggestion| G[SuggestionEvaluator] C -->|features| H[FeatureEvaluator] D --> I[E2EAPIClient] I --> J[In-Process ASGI Transport] J --> K[FastAPI / IXChat] K --> L[LangGraph Chatbot] E --> M[IntentClassifier] F --> N[SkillSelectorClassifier] G --> O[SuggestionGenerator] H --> P[FeatureChecker + E2EAPIClient] L --> Q[Langfuse Trace] D --> R[LLM-as-a-Judge] R --> S{Score vs Threshold} E --> S F --> S G --> S H --> S S -->|Pass| T[Exit 0] S -->|Fail| U[Exit 1]

Key design decisions:

  • In-process ASGI transport: E2E evaluations run the full API stack in-process via httpx.ASGITransport, avoiding the need for a running server.
  • Agentic system override: The --agentic-system flag (e2e only) flows through the API payload as forceNewAgenticSystem. When omitted, defaults to new.
  • LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
  • Classifiers in ixevaluation: E2EAPIClient, IntentClassifier, and SkillSelectorClassifier live in the ixevaluation package alongside their base classes/protocols.

Exit Codes

Code Meaning
0 Evaluation passed (metrics above threshold)
1 Evaluation failed (metrics below threshold)
2 Error (configuration, runtime, or unknown eval type)
130 Cancelled by user (Ctrl+C)

Log Files

All evaluations are logged to backend/logs/:

  • eval-{timestamp}.log — Full log for each run
  • eval-latest.log — Symlink to the most recent log file

The log file path is displayed at the start of each run.