rose-eval (CLI)¶
The rose-eval CLI tool runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.
Installation¶
The CLI tool is available after activating the Poetry shell in the backend directory:
Once in the Poetry shell, the rose-eval command is available.
Required Environment Variables¶
Ensure these environment variables are set (typically via .env.test):
| Variable | Description |
|---|---|
LANGFUSE_SECRET_KEY |
Langfuse API secret key |
LANGFUSE_PUBLIC_KEY |
Langfuse API public key |
LANGFUSE_HOST |
Langfuse API host URL |
AZURE_OPENAI_* |
Azure OpenAI configuration (for LLM calls) |
Usage¶
Commands¶
| Command | Description | Metric | Default Threshold |
|---|---|---|---|
e2e |
End-to-end API response accuracy (LLM-as-a-Judge) | Mean accuracy | 0.70 |
intent-classifier |
Intent classification accuracy | Macro F1 | 0.80 |
skill-selector |
Multi-label skill selection | Skill recall + Micro F1 | 0.90 (recall), 0.75 (micro F1) |
suggestion |
Suggestion quality evaluation | Mean quality | 0.60 |
features |
List or run feature-* eval datasets | Per-dataset | Per-dataset |
Shared Options¶
These options are available on all commands:
| Option | Required | Default | Description |
|---|---|---|---|
dataset |
Yes | - | Langfuse dataset name (positional argument) |
--env |
No | test |
Environment for credentials: test, development, staging, production |
--trace |
No | yes |
Enable Langfuse tracing: yes or no |
--local-prompt |
No | Auto | Use local prompt files (auto: true for test/development envs) |
--item |
No | - | Filter items by ID prefix or query substring |
--threshold |
No | Per type | Override pass/fail threshold (0.0-1.0) |
--sample-size |
No | - | Run on N random items for quick testing |
--verbose |
No | false |
Show detailed per-item results |
--json |
No | false |
Output results as JSON (for CI, also auto-activates when stdout is piped) |
--concurrency |
No | 3 |
Number of items to process concurrently |
--debug |
No | false |
Enable debug logging to console |
Command-Specific Options¶
e2e¶
| Option | Default | Description |
|---|---|---|
--agentic-system |
new |
Override agentic system routing: legacy or new |
skill-selector¶
| Option | Default | Description |
|---|---|---|
--micro-f1-threshold |
0.75 |
Override micro-F1 threshold |
suggestion¶
| Option | Default | Description |
|---|---|---|
--suggestion-type |
- | Filter by type: follow-up or answer |
--regenerate-model |
- | Re-generate suggestions with this model before judging (e.g., gpt-4.1-mini) |
--regenerate-provider |
openai |
Provider for re-generation: azure, openai, google, cerebras, anthropic, groq, groq_direct |
features¶
The features command has two sub-subcommands:
| Sub-command | Description |
|---|---|
list |
List all feature-* evaluation datasets |
run [name] |
Run feature evaluation(s). Omit name to run all. |
Examples¶
E2E evaluation (uses "new" agentic system by default)¶
E2E evaluation with explicit agentic system override¶
Quick smoke test on 5 random items¶
Intent classifier evaluation¶
Skill selector evaluation with custom threshold¶
Suggestion quality evaluation¶
rose-eval suggestion suggestion-quality
rose-eval suggestion suggestion-quality --suggestion-type follow-up
rose-eval suggestion suggestion-quality --regenerate-model gpt-4.1-mini
Feature eval datasets¶
rose-eval features list # List all feature-* datasets
rose-eval features run content --verbose # Run feature-content
rose-eval features run # Run all feature evals
Filter to specific items¶
Disable tracing for fast iteration¶
JSON output for CI¶
Datasets¶
Available Datasets¶
| Dataset | Command | Description |
|---|---|---|
main-dataset |
e2e |
Main evaluation dataset |
mayday.fr-v2 |
e2e |
Mayday.fr baseline (40 single-turn + 25 multi-turn) |
intent-classifier |
intent-classifier |
Intent classification dataset |
skill-selector |
skill-selector |
Skill selection dataset |
suggestion-quality |
suggestion |
Suggestion quality dataset |
Creating / Populating Datasets¶
Use rose-chat with the --trace --add-to-dataset flags to add conversations to a dataset:
# Add an E2E conversation to a dataset
rose-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2
# Add to a specific dataset type
rose-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier
Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).
See CLI Chat for full details on dataset population.
Justfile Wrapper¶
A backward-compatible just eval wrapper is available for common targets:
just eval e2e-api # → rose-eval e2e main-dataset --agentic-system legacy
just eval mayday # → rose-eval e2e mayday.fr-v2 --agentic-system legacy
just eval intent-classifier # → rose-eval intent-classifier intent-classifier
just eval skill-selector # → rose-eval skill-selector skill-selector
Prefer rose-eval directly
The just eval wrapper defaults to --trace no for backward compatibility. Use rose-eval directly for full control over tracing, agentic system selection, and all other options.
Architecture¶
Key design decisions:
- In-process ASGI transport: E2E evaluations run the full API stack in-process via
httpx.ASGITransport, avoiding the need for a running server. - Agentic system override: The
--agentic-systemflag (e2e only) flows through the API payload asforceNewAgenticSystem. When omitted, defaults tonew. - LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
- Classifiers in ixevaluation:
E2EAPIClient,IntentClassifier, andSkillSelectorClassifierlive in theixevaluationpackage alongside their base classes/protocols.
Exit Codes¶
| Code | Meaning |
|---|---|
0 |
Evaluation passed (metrics above threshold) |
1 |
Evaluation failed (metrics below threshold) |
2 |
Error (configuration, runtime, or unknown eval type) |
130 |
Cancelled by user (Ctrl+C) |
Log Files¶
All evaluations are logged to backend/logs/:
eval-{timestamp}.log— Full log for each runeval-latest.log— Symlink to the most recent log file
The log file path is displayed at the start of each run.
Related Documentation¶
- CLI Chat — Interactive chatbot testing and dataset population
- CLI Langfuse — Manage Langfuse prompts, datasets, and traces
- Backend Setup — Environment configuration
- IXChat Package — Chatbot and graph structure details
- Skill System — Skill definitions and registry