CLI Eval¶
The cli-eval command runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.
Installation¶
The CLI tool is available after activating the Poetry shell in the backend directory:
Once in the Poetry shell, the cli-eval command is available.
Required Environment Variables¶
Ensure these environment variables are set (typically via .env.test):
| Variable | Description |
|---|---|
LANGFUSE_SECRET_KEY |
Langfuse API secret key |
LANGFUSE_PUBLIC_KEY |
Langfuse API public key |
LANGFUSE_HOST |
Langfuse API host URL |
AZURE_OPENAI_* |
Azure OpenAI configuration (for LLM calls) |
Usage¶
Evaluation Types¶
| Type | Description | Metric | Default Threshold |
|---|---|---|---|
e2e |
End-to-end API response accuracy (LLM-as-a-Judge) | Mean accuracy | 0.70 |
intent-classifier |
Intent classification accuracy | Macro F1 | 0.80 |
skill-selector |
Multi-label skill selection | Skill recall + Micro F1 | 0.90 (recall), 0.75 (micro F1) |
Options¶
| Option | Required | Default | Description |
|---|---|---|---|
--type |
Yes | - | Evaluation type: e2e, intent-classifier, skill-selector |
--dataset |
Yes | - | Langfuse dataset name |
--agentic-system |
No | Site config | Override agentic system routing: legacy or new. When omitted, uses the per-site custom_config setting. |
--env |
No | test |
Environment for credentials: test, development, staging, production |
--trace |
No | yes |
Enable Langfuse tracing: yes or no |
--local-prompt |
No | Auto | Use local prompt files (auto: true for test/development envs) |
--item |
No | - | Filter items by ID prefix or query substring |
--threshold |
No | Per type | Override pass/fail threshold (0.0-1.0) |
--micro-f1-threshold |
No | 0.75 |
Override micro-F1 threshold for skill-selector |
--sample-size |
No | - | Run on N random items for quick testing |
--verbose |
No | false |
Show detailed per-item results |
--json |
No | false |
Output results as JSON (for CI, also auto-activates when stdout is piped) |
--concurrency |
No | 3 |
Number of items to process concurrently |
--debug |
No | false |
Enable debug logging to console |
Examples¶
E2E evaluation (uses site's configured agentic system)¶
E2E evaluation with explicit agentic system override¶
Quick smoke test on 5 random items¶
Intent classifier evaluation¶
Skill selector evaluation with custom threshold¶
Filter to specific items¶
Disable tracing for fast iteration¶
JSON output for CI¶
Datasets¶
Available Datasets¶
| Dataset | Type | Description |
|---|---|---|
main-dataset |
e2e | Main evaluation dataset |
mayday.fr-v2 |
e2e | Mayday.fr baseline (40 single-turn + 25 multi-turn) |
intent-classifier |
classification | Intent classification dataset |
skill-selector |
multi-label | Skill selection dataset |
Creating / Populating Datasets¶
Use cli-chat with the --trace --add-to-dataset flags to add conversations to a dataset:
# Add an E2E conversation to a dataset
cli-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2
# Add to a specific dataset type
cli-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier
Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).
See CLI Chat for full details on dataset population.
Justfile Wrapper¶
A backward-compatible just eval wrapper is available for common targets:
just eval e2e-api # → cli-eval --type e2e --dataset main-dataset --agentic-system legacy
just eval mayday # → cli-eval --type e2e --dataset mayday.fr-v2 --agentic-system legacy
just eval intent-classifier # → cli-eval --type intent-classifier --dataset intent-classifier
just eval skill-selector # → cli-eval --type skill-selector --dataset skill-selector
Prefer cli-eval directly
The just eval wrapper defaults to --trace no for backward compatibility. Use cli-eval directly for full control over tracing, agentic system selection, and all other options.
Architecture¶
Key design decisions:
- In-process ASGI transport: E2E evaluations run the full API stack in-process via
httpx.ASGITransport, avoiding the need for a running server. - Agentic system override: The
--agentic-systemflag flows through the API payload asforceNewAgenticSystem. When omitted (None), the per-sitecustom_configdecides the routing ingraph_structure.py. - LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
- Classifiers in ixevaluation:
E2EAPIClient,IntentClassifier, andSkillSelectorClassifierlive in theixevaluationpackage alongside their base classes/protocols.
Exit Codes¶
| Code | Meaning |
|---|---|
0 |
Evaluation passed (metrics above threshold) |
1 |
Evaluation failed (metrics below threshold) |
2 |
Error (configuration, runtime, or unknown eval type) |
130 |
Cancelled by user (Ctrl+C) |
Log Files¶
All evaluations are logged to backend/logs/:
eval-{timestamp}.log— Full log for each runeval-latest.log— Symlink to the most recent log file
The log file path is displayed at the start of each run.
Related Documentation¶
- CLI Chat — Interactive chatbot testing and dataset population
- CLI Langfuse — Manage Langfuse prompts, datasets, and traces
- Backend Setup — Environment configuration
- IXChat Package — Chatbot and graph structure details
- Skill System — Skill definitions and registry