CLI Eval¶

The cli-eval command runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.

Installation¶

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the cli-eval command is available.

Required Environment Variables¶

Ensure these environment variables are set (typically via .env.test):

Variable	Description
`LANGFUSE_SECRET_KEY`	Langfuse API secret key
`LANGFUSE_PUBLIC_KEY`	Langfuse API public key
`LANGFUSE_HOST`	Langfuse API host URL
`AZURE_OPENAI_*`	Azure OpenAI configuration (for LLM calls)

Usage¶

cli-eval --type <type> --dataset <name> [OPTIONS]

Evaluation Types¶

Type	Description	Metric	Default Threshold
`e2e`	End-to-end API response accuracy (LLM-as-a-Judge)	Mean accuracy	0.70
`intent-classifier`	Intent classification accuracy	Macro F1	0.80
`skill-selector`	Multi-label skill selection	Skill recall + Micro F1	0.90 (recall), 0.75 (micro F1)

Options¶

Option	Required	Default	Description
`--type`	Yes	-	Evaluation type: `e2e`, `intent-classifier`, `skill-selector`
`--dataset`	Yes	-	Langfuse dataset name
`--agentic-system`	No	Site config	Override agentic system routing: `legacy` or `new`. When omitted, uses the per-site `custom_config` setting.
`--env`	No	`test`	Environment for credentials: `test`, `development`, `staging`, `production`
`--trace`	No	`yes`	Enable Langfuse tracing: `yes` or `no`
`--local-prompt`	No	Auto	Use local prompt files (auto: true for `test`/`development` envs)
`--item`	No	-	Filter items by ID prefix or query substring
`--threshold`	No	Per type	Override pass/fail threshold (0.0-1.0)
`--micro-f1-threshold`	No	`0.75`	Override micro-F1 threshold for `skill-selector`
`--sample-size`	No	-	Run on N random items for quick testing
`--verbose`	No	`false`	Show detailed per-item results
`--json`	No	`false`	Output results as JSON (for CI, also auto-activates when stdout is piped)
`--concurrency`	No	`3`	Number of items to process concurrently
`--debug`	No	`false`	Enable debug logging to console

Examples¶

E2E evaluation (uses site's configured agentic system)¶

cli-eval --type e2e --dataset mayday.fr-v2

E2E evaluation with explicit agentic system override¶

cli-eval --type e2e --dataset mayday.fr-v2 --agentic-system new

Quick smoke test on 5 random items¶

cli-eval --type e2e --dataset mayday.fr-v2 --sample-size 5

Intent classifier evaluation¶

cli-eval --type intent-classifier --dataset intent-classifier

Skill selector evaluation with custom threshold¶

cli-eval --type skill-selector --dataset skill-selector --threshold 0.85

Filter to specific items¶

cli-eval --type e2e --dataset mayday.fr-v2 --item "Pricing" --verbose

Disable tracing for fast iteration¶

cli-eval --type e2e --dataset mayday.fr-v2 --trace no

JSON output for CI¶

cli-eval --type e2e --dataset main-dataset --json

Datasets¶

Available Datasets¶

Dataset	Type	Description
`main-dataset`	e2e	Main evaluation dataset
`mayday.fr-v2`	e2e	Mayday.fr baseline (40 single-turn + 25 multi-turn)
`intent-classifier`	classification	Intent classification dataset
`skill-selector`	multi-label	Skill selection dataset

Creating / Populating Datasets¶

Use cli-chat with the --trace --add-to-dataset flags to add conversations to a dataset:

# Add an E2E conversation to a dataset
cli-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2

# Add to a specific dataset type
cli-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier

Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).

See CLI Chat for full details on dataset population.

Justfile Wrapper¶

A backward-compatible just eval wrapper is available for common targets:

just eval e2e-api                # → cli-eval --type e2e --dataset main-dataset --agentic-system legacy
just eval mayday                 # → cli-eval --type e2e --dataset mayday.fr-v2 --agentic-system legacy
just eval intent-classifier      # → cli-eval --type intent-classifier --dataset intent-classifier
just eval skill-selector         # → cli-eval --type skill-selector --dataset skill-selector

Prefer cli-eval directly

The just eval wrapper defaults to --trace no for backward compatibility. Use cli-eval directly for full control over tracing, agentic system selection, and all other options.

Architecture¶

flowchart TD A[cli-eval] --> B[Setup Environment] B --> C{Evaluation Type} C -->|e2e| D[E2EAPIEvaluator] C -->|intent-classifier| E[ClassificationEvaluator] C -->|skill-selector| F[MultiLabelEvaluator] D --> G[E2EAPIClient] G --> H[In-Process ASGI Transport] H --> I[FastAPI / IXChat] I --> J[LangGraph Chatbot] E --> K[IntentClassifier] F --> L[SkillSelectorClassifier] J --> M[Langfuse Trace] D --> N[LLM-as-a-Judge] N --> O{Score vs Threshold} E --> O F --> O O -->|Pass| P[Exit 0] O -->|Fail| Q[Exit 1]

Key design decisions:

In-process ASGI transport: E2E evaluations run the full API stack in-process via httpx.ASGITransport, avoiding the need for a running server.
Agentic system override: The --agentic-system flag flows through the API payload as forceNewAgenticSystem. When omitted (None), the per-site custom_config decides the routing in graph_structure.py.
LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
Classifiers in ixevaluation: E2EAPIClient, IntentClassifier, and SkillSelectorClassifier live in the ixevaluation package alongside their base classes/protocols.

Exit Codes¶

Code	Meaning
`0`	Evaluation passed (metrics above threshold)
`1`	Evaluation failed (metrics below threshold)
`2`	Error (configuration, runtime, or unknown eval type)
`130`	Cancelled by user (Ctrl+C)

Log Files¶

All evaluations are logged to backend/logs/:

eval-{timestamp}.log — Full log for each run
eval-latest.log — Symlink to the most recent log file

The log file path is displayed at the start of each run.

CLI Chat — Interactive chatbot testing and dataset population
CLI Langfuse — Manage Langfuse prompts, datasets, and traces
Backend Setup — Environment configuration
IXChat Package — Chatbot and graph structure details
Skill System — Skill definitions and registry