Skip to content

CLI Eval

The cli-eval command runs evaluations against Langfuse datasets. It directly invokes ixevaluation evaluators, with Langfuse tracing enabled by default and explicit control over all parameters.

Installation

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the cli-eval command is available.

Required Environment Variables

Ensure these environment variables are set (typically via .env.test):

Variable Description
LANGFUSE_SECRET_KEY Langfuse API secret key
LANGFUSE_PUBLIC_KEY Langfuse API public key
LANGFUSE_HOST Langfuse API host URL
AZURE_OPENAI_* Azure OpenAI configuration (for LLM calls)

Usage

cli-eval --type <type> --dataset <name> [OPTIONS]

Evaluation Types

Type Description Metric Default Threshold
e2e End-to-end API response accuracy (LLM-as-a-Judge) Mean accuracy 0.70
intent-classifier Intent classification accuracy Macro F1 0.80
skill-selector Multi-label skill selection Skill recall + Micro F1 0.90 (recall), 0.75 (micro F1)

Options

Option Required Default Description
--type Yes - Evaluation type: e2e, intent-classifier, skill-selector
--dataset Yes - Langfuse dataset name
--agentic-system No Site config Override agentic system routing: legacy or new. When omitted, uses the per-site custom_config setting.
--env No test Environment for credentials: test, development, staging, production
--trace No yes Enable Langfuse tracing: yes or no
--local-prompt No Auto Use local prompt files (auto: true for test/development envs)
--item No - Filter items by ID prefix or query substring
--threshold No Per type Override pass/fail threshold (0.0-1.0)
--micro-f1-threshold No 0.75 Override micro-F1 threshold for skill-selector
--sample-size No - Run on N random items for quick testing
--verbose No false Show detailed per-item results
--json No false Output results as JSON (for CI, also auto-activates when stdout is piped)
--concurrency No 3 Number of items to process concurrently
--debug No false Enable debug logging to console

Examples

E2E evaluation (uses site's configured agentic system)

cli-eval --type e2e --dataset mayday.fr-v2

E2E evaluation with explicit agentic system override

cli-eval --type e2e --dataset mayday.fr-v2 --agentic-system new

Quick smoke test on 5 random items

cli-eval --type e2e --dataset mayday.fr-v2 --sample-size 5

Intent classifier evaluation

cli-eval --type intent-classifier --dataset intent-classifier

Skill selector evaluation with custom threshold

cli-eval --type skill-selector --dataset skill-selector --threshold 0.85

Filter to specific items

cli-eval --type e2e --dataset mayday.fr-v2 --item "Pricing" --verbose

Disable tracing for fast iteration

cli-eval --type e2e --dataset mayday.fr-v2 --trace no

JSON output for CI

cli-eval --type e2e --dataset main-dataset --json

Datasets

Available Datasets

Dataset Type Description
main-dataset e2e Main evaluation dataset
mayday.fr-v2 e2e Mayday.fr baseline (40 single-turn + 25 multi-turn)
intent-classifier classification Intent classification dataset
skill-selector multi-label Skill selection dataset

Creating / Populating Datasets

Use cli-chat with the --trace --add-to-dataset flags to add conversations to a dataset:

# Add an E2E conversation to a dataset
cli-chat "What does your product do?" --site mayday.fr --trace --add-to-dataset mayday.fr-v2

# Add to a specific dataset type
cli-chat "I need help" --site mayday.fr --trace --add-to-dataset intent-classifier --dataset-type intent-classifier

Each conversation is traced via Langfuse and the trace is added to the dataset in evaluator-compatible format ({"query": "..."} / {"response": "..."}).

See CLI Chat for full details on dataset population.


Justfile Wrapper

A backward-compatible just eval wrapper is available for common targets:

just eval e2e-api                # → cli-eval --type e2e --dataset main-dataset --agentic-system legacy
just eval mayday                 # → cli-eval --type e2e --dataset mayday.fr-v2 --agentic-system legacy
just eval intent-classifier      # → cli-eval --type intent-classifier --dataset intent-classifier
just eval skill-selector         # → cli-eval --type skill-selector --dataset skill-selector

Prefer cli-eval directly

The just eval wrapper defaults to --trace no for backward compatibility. Use cli-eval directly for full control over tracing, agentic system selection, and all other options.


Architecture

flowchart TD A[cli-eval] --> B[Setup Environment] B --> C{Evaluation Type} C -->|e2e| D[E2EAPIEvaluator] C -->|intent-classifier| E[ClassificationEvaluator] C -->|skill-selector| F[MultiLabelEvaluator] D --> G[E2EAPIClient] G --> H[In-Process ASGI Transport] H --> I[FastAPI / IXChat] I --> J[LangGraph Chatbot] E --> K[IntentClassifier] F --> L[SkillSelectorClassifier] J --> M[Langfuse Trace] D --> N[LLM-as-a-Judge] N --> O{Score vs Threshold} E --> O F --> O O -->|Pass| P[Exit 0] O -->|Fail| Q[Exit 1]

Key design decisions:

  • In-process ASGI transport: E2E evaluations run the full API stack in-process via httpx.ASGITransport, avoiding the need for a running server.
  • Agentic system override: The --agentic-system flag flows through the API payload as forceNewAgenticSystem. When omitted (None), the per-site custom_config decides the routing in graph_structure.py.
  • LLM-as-a-Judge: E2E evaluation uses an LLM to compare the actual response against the expected baseline from the dataset.
  • Classifiers in ixevaluation: E2EAPIClient, IntentClassifier, and SkillSelectorClassifier live in the ixevaluation package alongside their base classes/protocols.

Exit Codes

Code Meaning
0 Evaluation passed (metrics above threshold)
1 Evaluation failed (metrics below threshold)
2 Error (configuration, runtime, or unknown eval type)
130 Cancelled by user (Ctrl+C)

Log Files

All evaluations are logged to backend/logs/:

  • eval-{timestamp}.log — Full log for each run
  • eval-latest.log — Symlink to the most recent log file

The log file path is displayed at the start of each run.