Skip to content

E2E Evaluation: History Reconstruction

Overview

The E2E evaluator tests multi-turn conversations by replaying history and scoring the bot's final response. The challenge: LangGraph nodes depend on graph state (profiling progress, interest signals, dialog markers) that is normally built incrementally across turns. Simply injecting message history isn't enough — the evaluator must also reconstruct the state that those messages would have produced.

This document explains how E2EAPIClient reconstructs full conversation context for deterministic multi-turn evaluation.

Architecture

flowchart TD subgraph Langfuse DS[Dataset Item] DS --> Q[query] DS --> H[history] DS --> E[expected_output] end subgraph "E2EAPIClient" Q --> FH[_filter_history_for_injection] H --> FH FH --> IH[_inject_history_into_session] IH --> M[Convert to LangChain messages] IH --> DSS[Extract DialogSupervisionState\nfrom emoji markers] IH --> PS[_extract_profiling_from_history\nfrom Q&A pairs] IH --> IS[Inject InterestSignalsState\nif qualification complete] M --> UP[graph.aupdate_state] DSS --> UP PS --> UP IS --> UP UP --> SQ[_send_query with final query] end subgraph "LangGraph" SQ --> G[Graph runs with injected checkpoint] G --> R[Response] end R --> SC[AnswerQualityMetric\nLLM-as-Judge scoring]

Why History Reconstruction Exists

In normal runtime, the graph state is built incrementally:

  1. Turn 1 → profile_extractor populates profiling_state.collected_values
  2. Turn 2 → interest_signals_detector accumulates cumulative_score
  3. Turn 3 → dialog_state_extractor tracks emoji markers in dialog_supervision_state

But in evaluation, we send only the final query through the API. All prior context must be reconstructed from the dataset's history field and injected into the LangGraph checkpoint via graph.aupdate_state().

Without reconstruction, multi-turn test items fail because:

  • profiling_state is empty → CTA stays blocked even when qualification is complete
  • interest_signals_state has cumulative_score=0should_propose_demo=False → no CTA
  • dialog_supervision_state has no emoji markers → content gating state machine doesn't advance

Dataset Item Format

Each evaluation item in Langfuse has:

{
  "input": {
    "query": "Technical"
  },
  "expected_output": {
    "response": "The bot should propose the CTA using the 👇 emoji..."
  },
  "metadata": {
    "site_name": "hexa.com",
    "history": [
      {"role": "user", "content": "I want to apply"},
      {"role": "assistant", "content": "👉 What stage is your company at?"},
      {"role": "user", "content": "I'm exploring an idea"},
      {"role": "assistant", "content": "👉 Business or technical profile?"}
    ]
  }
}

History Reconstruction Pipeline

Step 1: Filter History

Method: _filter_history_for_injection(history, final_query)

The dataset's history may include the turn being tested. The filter excludes it:

  • Sorts by message_index
  • Stops before the user message matching final_query
  • Skips empty user messages (widget reconnect artifacts)
  • Removes duplicate assistant responses

Step 2: Convert to LangChain Messages

Each turn is converted:

"user"       HumanMessage(content=...)
"assistant"  AIMessage(content=...)

The turn_number is set to the count of assistant messages. This is critical — skill_selector uses turn_number to compute cooldowns and multi-step flow sequencing.

Step 3: Extract Dialog Supervision State

Scans the last assistant message for emoji markers:

Marker State Field Effect
💌 email_asked_last_turn=True Triggers content gating Step 2 (capture email)
👇 demo_was_proposed_last_turn=True Affects score reset and demo re-proposal

Turn number alignment: The evaluator calculates content_gating_ask_turn so that skill_selector sees turns_since_ask == 1 on the next turn:

num_messages = len(messages)
turn_number_at_skill_selector = num_messages // 2 + 1
emoji_turn = turn_number_at_skill_selector - 1
# skill_selector sees: turns_since_ask = turn_number - emoji_turn = 1

Step 4: Extract Profiling State

Method: _extract_profiling_from_history(messages, site_name)

Reconstructs profiling_state.collected_values from 👉 Q&A pairs:

  1. Load the site's profiling config via AgentConfigResolver.for_site(site_name)
  2. Extract field_ids from config.qualification.profiling.forms[*].fields[*]
  3. Scan messages for patterns: AIMessage with 👉 followed by HumanMessage
  4. Map each pair to the corresponding field_id in order
Assistant: "👉 What stage is your company at?"   →  field_ids[0]
User:      "I'm exploring"                       →  collected["what_stage_is_your_company_at"] = "I'm exploring"

Step 5: Inject Interest Signals State

After profiling extraction, the evaluator checks whether all CTA-blocking fields are now collected:

if not resolver.has_unresolved_cta_blocking_fields(profiling_state):
    state_update["interest_signals_state"] = InterestSignalsState(
        cumulative_score=resolver.interest_threshold,
        should_propose_demo=True,
        should_propose_demo_phrase=DEMO_PROPOSE_PHRASE,
        interest_threshold=resolver.interest_threshold,
    )

This ensures demo_offer triggers immediately after qualification completes. Without this, should_propose_demo defaults to False and the bot never proposes the CTA.

Step 6: Apply State to Graph

All reconstructed state is written to the LangGraph checkpoint in a single call:

state_update = {
    "messages": messages,
    "turn_number": assistant_turn_count,
    "dialog_supervision_state": ...,    # if emoji markers found
    "profiling_state": ...,             # if 👉 Q&A pairs found
    "interest_signals_state": ...,      # if qualification complete
}
await graph.aupdate_state(config, state_update)

The next _send_query() call sends only the final query. The graph reads the injected checkpoint and runs as if the full conversation had occurred naturally.

State Summary

State Source Reconstructed From Purpose
messages History turns HumanMessage / AIMessage conversion Conversation context
turn_number Assistant message count Number of assistant turns in history Multi-step flow sequencing
dialog_supervision_state Last assistant message 💌 and 👇 emoji detection Content gating state machine
profiling_state 👉 Q&A pairs + site config Matching emoji pairs to field IDs CTA blocking gate
interest_signals_state Qualification completeness check Set to threshold when all blocks_cta fields collected Demo offer trigger

Example: Full Multi-Turn Qualification Test

Dataset item: Final query = "Technical", history = 2 qualification turns

History:
  User: "I want to apply"
  Assistant: "👉 What stage is your company at?"   ← qualification Q1
  User: "I'm exploring an idea"                    ← answer to Q1
  Assistant: "👉 Business or technical profile?"    ← qualification Q2

Final query: "Technical"                            ← answer to Q2

Reconstructed state:

Field Value
messages 4 messages (2 user + 2 assistant)
turn_number 2 (2 assistant turns)
profiling_state.collected_values {"what_stage_is_your_company_at": "I'm exploring an idea", "are_you_a_business_or_technical_profile": ...}
interest_signals_state.should_propose_demo True (all blocks_cta fields collected)
interest_signals_state.cumulative_score 7 (= threshold)

Result: Bot proposes CTA with 👇 → evaluator scores 1.0

Seeding Datasets

Use backend/scripts/seed_hexa_dataset.py as a reference for creating evaluation datasets:

cd backend && poetry run python scripts/seed_hexa_dataset.py
cd backend && poetry run python scripts/seed_hexa_dataset.py --purge-only

The script creates items using LangfuseDatasetClient.create_item() with proper input, expected_output, and metadata (including history for multi-turn items).

Running Evaluations

# Full dataset
rose-eval --type e2e --dataset hexa.com-v1 --agentic-system new --verbose --local-prompt

# Single item by query substring
rose-eval --type e2e --dataset hexa.com-v1 --item "Technical" --verbose --local-prompt --trace no

# Quick smoke test
rose-eval --type e2e --dataset hexa.com-v1 --sample-size 5 --trace no

Key Files

File Purpose
ixevaluation/e2e/client.py E2EAPIClient — history reconstruction and API calls
ixevaluation/e2e/evaluator.py E2EAPIEvaluator — orchestrates dataset processing
ixevaluation/e2e/base.py Protocols, result models
ixevaluation/metrics/answer_quality.py LLM-as-Judge scoring (correctness, completeness, flow, format)
ixchat/nodes/dialog_state_extractor.py Runtime emoji detection (what the evaluator simulates)
ixchat/nodes/skill_selector.py Consumes injected state for skill selection
ixchat/utils/agent_config.py AgentConfigResolver — CTA blocking checks
ixchat/pydantic_models/profiling.py ProfilingState, ProfilingField
ixchat/pydantic_models/interest_signals_state.py InterestSignalsState
ixchat/pydantic_models/dialog_supervision_state.py DialogSupervisionState

Design Decisions

Why not replay through the API?

Replaying each history turn through the full API would be accurate but:

  • Slow: Each turn involves 3-4 LLM calls. A 3-turn history would need 12+ LLM calls before the test query.
  • Non-deterministic: LLM responses would differ from the history, causing state divergence.
  • Expensive: Token costs scale linearly with history depth.

State injection via aupdate_state() is instant and deterministic.

Why emoji-based state detection?

Emoji markers (👉, 👇, 💌) are injected by the LLM as part of skill instructions. They serve as a protocol between the LLM output and deterministic graph nodes:

  • The LLM doesn't need to output structured JSON
  • The frontend strips them before rendering
  • The evaluator can detect them in history without running an LLM

Why field_id order matching?

The profiling extraction maps 👉 Q&A pairs to field_ids by index order (first 👉 = first field, second 👉 = second field). This works because the qualification rules present fields in config order, and the LLM follows that order.