E2E Evaluation: History Reconstruction¶

Overview¶

The E2E evaluator tests multi-turn conversations by replaying history and scoring the bot's final response. The challenge: LangGraph nodes depend on graph state (profiling progress, interest signals, dialog markers) that is normally built incrementally across turns. Simply injecting message history isn't enough — the evaluator must also reconstruct the state that those messages would have produced.

This document explains how E2EAPIClient reconstructs full conversation context for deterministic multi-turn evaluation.

Architecture¶

flowchart TD subgraph Langfuse DS[Dataset Item] DS --> Q[query] DS --> H[history] DS --> E[expected_output] end subgraph "E2EAPIClient" Q --> FH[_filter_history_for_injection] H --> FH FH --> IH[_inject_history_into_session] IH --> M[Convert to LangChain messages] IH --> DSS[Extract DialogSupervisionState\nfrom emoji markers] IH --> PS[_extract_profiling_from_history\nfrom Q&A pairs] IH --> QP{Last assistant\nhas 👉 unanswered?} QP -->|Yes| SET[Set last_extraction_turn\n= assistant_turn_count] QP -->|No| SKIP[Default last_extraction_turn=-1] M --> UP[graph.aupdate_state] DSS --> UP PS --> UP SET --> UP SKIP --> UP UP --> SQ[_send_query with final query] end subgraph "LangGraph" SQ --> G[Graph runs with injected checkpoint] G --> ISD[interest_signals_detector\ncomputes scores at runtime] G --> EPE[early_profile_extractor\nGuard 3: last_extraction_turn check] ISD --> R[Response] EPE --> R end R --> SC[AnswerQualityMetric\nLLM-as-Judge scoring]

Why History Reconstruction Exists¶

In normal runtime, the graph state is built incrementally:

Turn 1 → profile_extractor populates profiling_state.collected_values
Turn 2 → interest_signals_detector accumulates cumulative_score
Turn 3 → dialog_state_extractor tracks emoji markers in dialog_supervision_state

But in evaluation, we send only the final query through the API. All prior context must be reconstructed from the dataset's history field and injected into the LangGraph checkpoint via graph.aupdate_state().

Without reconstruction, multi-turn test items fail because:

profiling_state is empty → CTA stays blocked even when qualification is complete
interest_signals_state is not yet computed → interest_signals_detector needs the history to score naturally
dialog_supervision_state has no emoji markers → content gating state machine doesn't advance

Dataset Item Format¶

Each evaluation item in Langfuse has:

{
  "input": {
    "query": "Technical"
  },
  "expected_output": {
    "response": "The bot should propose the CTA using the 👇 emoji..."
  },
  "metadata": {
    "site_name": "hexa.com",
    "history": [
      {"role": "user", "content": "I want to apply"},
      {"role": "assistant", "content": "👉 What stage is your company at?"},
      {"role": "user", "content": "I'm exploring an idea"},
      {"role": "assistant", "content": "👉 Business or technical profile?"}
    ]
  }
}

History Reconstruction Pipeline¶

Step 1: Filter History¶

Method: _filter_history_for_injection(history, final_query)

The dataset's history may include the turn being tested. The filter excludes it:

Sorts by message_index
Stops before the user message matching final_query
Skips empty user messages (widget reconnect artifacts)
Removes duplicate assistant responses

Step 2: Convert to LangChain Messages¶

Each turn is converted:

"user"      → HumanMessage(content=...)
"assistant" → AIMessage(content=...)

The turn_number is set to the count of assistant messages. This is critical — skill_selector uses turn_number to compute cooldowns and multi-step flow sequencing.

Step 3: Extract Dialog Supervision State¶

Scans the last assistant message for emoji markers:

Marker	State Field	Effect
💌	`email_asked_last_turn=True`	Triggers content gating Step 2 (capture email)
👇	`demo_was_proposed_last_turn=True`	Affects score reset and demo re-proposal

Turn number alignment: The evaluator calculates content_gating_ask_turn so that skill_selector sees turns_since_ask == 1 on the next turn:

num_messages = len(messages)
turn_number_at_skill_selector = num_messages // 2 + 1
emoji_turn = turn_number_at_skill_selector - 1
# skill_selector sees: turns_since_ask = turn_number - emoji_turn = 1

Step 4: Extract Profiling State¶

Method: _extract_profiling_from_history(messages, site_name)

Reconstructs profiling_state.collected_values from 👉 Q&A pairs:

Load the site's resolved config via resolve_config_for_domain(site_name)
Extract field_ids from resolver.qualification.profiling.forms[*].field_groups[*].fields[*]
Scan messages for patterns: AIMessage with 👉 followed by HumanMessage
Map each pair to the corresponding field_id in order

Assistant: "👉 What stage is your company at?"   →  field_ids[0]
User:      "I'm exploring"                       →  collected["company_stage"] = "I'm exploring"

Step 4b: Detect Pending Qualification¶

When the last assistant message contains 👉 but no user answer follows in the history (i.e., the final query IS the answer), the evaluator sets:

profiling_state.last_extraction_turn = assistant_turn_count — so early_profile_extractor Guard 3 passes and the node fires to extract the answer from the final query

Without this, the early extractor skips (Guard 3 fails with default last_extraction_turn=-1) and the agent re-asks the qualification question.

Step 5: What Is NOT Injected¶

interest_signals_state is never injected. Interest scores are computed by interest_signals_detector at runtime based on the actual conversation. Injecting them would bypass the node's scoring logic and produce false-positive CTA triggers.

The interest_signals_detector runs in parallel with other analysis nodes and evaluates the injected history + final query to produce a natural cumulative score. The CTA only fires if the score genuinely reaches the threshold.

Step 6: Apply State to Graph¶

All reconstructed state is written to the LangGraph checkpoint in a single call:

state_update = {
    "messages": messages,
    "turn_number": assistant_turn_count,
    "dialog_supervision_state": ...,    # if emoji markers found
    "profiling_state": ...,             # if 👉 Q&A pairs found or qualification pending
}
await graph.aupdate_state(config, state_update)

The next _send_query() call sends only the final query. The graph reads the injected checkpoint and runs as if the full conversation had occurred naturally. Nodes like interest_signals_detector run on the injected history and compute their state naturally.

State Summary¶

State	Source	Reconstructed From	Purpose
`messages`	History turns	`HumanMessage` / `AIMessage` conversion	Conversation context
`turn_number`	Assistant message count	Number of assistant turns in history	Multi-step flow sequencing
`dialog_supervision_state`	Last assistant message	💌 and 👇 emoji detection	Content gating state machine
`profiling_state`	👉 Q&A pairs + site config	Matching emoji pairs to field IDs	CTA blocking gate

Not injected: interest_signals_state — computed at runtime by interest_signals_detector.

Example: Full Multi-Turn Qualification Test¶

Dataset item: Final query = "Technical", history = 2 qualification turns

History:
  User: "I want to apply"
  Assistant: "👉 What stage is your company at?"   ← qualification Q1
  User: "I'm exploring an idea"                    ← answer to Q1
  Assistant: "👉 Business or technical profile?"    ← qualification Q2

Final query: "Technical"                            ← answer to Q2

Reconstructed state:

Field	Value
`messages`	4 messages (2 user + 2 assistant)
`turn_number`	2 (2 assistant turns)
`profiling_state.collected_values`	`{"company_stage": "I'm exploring an idea"}`
`profiling_state.last_extraction_turn`	2 (qualification pending — user answering Q2)

Result: early_profile_extractor fires, extracts answer from final query. interest_signals_detector scores naturally. Bot behavior depends on actual interest level.

Seeding Datasets¶

Use backend/scripts/seed_hexa_dataset.py as a reference for creating evaluation datasets:

cd backend && poetry run python scripts/seed_hexa_dataset.py
cd backend && poetry run python scripts/seed_hexa_dataset.py --purge-only

The script creates items using LangfuseDatasetClient.create_item() with proper input, expected_output, and metadata (including history for multi-turn items).

Running Evaluations¶

# Full dataset
rose-eval e2e hexa.com-v1 --verbose --local-prompt

# Single item by query substring
rose-eval e2e hexa.com-v1 --item "Technical" --verbose --local-prompt --trace no

# Quick smoke test
rose-eval e2e hexa.com-v1 --sample-size 5 --trace no

Key Files¶

File	Purpose
`ixevaluation/e2e/client.py`	`E2EAPIClient` — history reconstruction and API calls
`ixevaluation/e2e/evaluator.py`	`E2EAPIEvaluator` — orchestrates dataset processing
`ixevaluation/e2e/base.py`	Protocols, result models
`ixevaluation/metrics/answer_quality.py`	LLM-as-Judge scoring (correctness, completeness, flow, format)
`ixchat/nodes/dialog_state_extractor.py`	Runtime emoji detection (what the evaluator simulates)
`ixchat/nodes/skill_selector.py`	Consumes injected state for skill selection
`ixchat/utils/agent_config.py`	ConfigResolver helper functions — CTA blocking checks
`ixchat/pydantic_models/profiling.py`	`ProfilingState`, `ProfilingField`
`ixchat/pydantic_models/interest_signals_state.py`	`InterestSignalsState`
`ixchat/pydantic_models/dialog_supervision_state.py`	`DialogSupervisionState`

Design Decisions¶

Why not replay through the API?¶

Replaying each history turn through the full API would be accurate but:

Slow: Each turn involves 3-4 LLM calls. A 3-turn history would need 12+ LLM calls before the test query.
Non-deterministic: LLM responses would differ from the history, causing state divergence.
Expensive: Token costs scale linearly with history depth.

State injection via aupdate_state() is instant and deterministic.

Why emoji-based state detection?¶

Emoji markers (👉, 👇, 💌) are injected by the LLM as part of skill instructions. They serve as a protocol between the LLM output and deterministic graph nodes:

The LLM doesn't need to output structured JSON
The frontend strips them before rendering
The evaluator can detect them in history without running an LLM

Why field_id order matching?¶

The profiling extraction maps 👉 Q&A pairs to the configured field_ids by index order (first 👉 = first field, second 👉 = second field). This works because field_groups[*].fields[*] preserve config order, and the qualification rules present questions in that same order.

Qualification Flow & CTA Blocking — How the qualification system works at runtime
Interest Signals — Cumulative scoring that triggers demo proposals
rose-eval CLI — Command-line reference for running evaluations
IXChat Package — Evaluation framework overview