IXChat Package¶

LangGraph-based chatbot with retrieval-augmented generation (RAG), conversation memory, and intelligent visitor enrichment.

Overview¶

The ixchat package provides the core chatbot functionality for the Rose platform. It uses LangGraph to orchestrate a complex workflow of specialized nodes that handle:

Document Retrieval: Fetches relevant context from LightRAG
Visitor Enrichment: Identifies companies from IP addresses
Response Generation: Produces contextual answers with LLM
Suggestion Generation: Creates follow-up questions or answer options
Dialog Supervision: Tracks conversation state and signals

Architecture Diagram¶

The following diagram shows the LangGraph structure. It is auto-generated from the graph definition during just build or just dev.

--- config: flowchart: curve: linear --- graph TD; __start__([<p>__start__</p>]):::first retrieval_task_starter(retrieval_task_starter) retrieval_awaiter(retrieval_awaiter) third_party_enricher(third_party_enricher) visitor_profiler(visitor_profiler) interest_signals_detector(interest_signals_detector) intent_classifier(intent_classifier) action_router(action_router) skill_selector(skill_selector) dialog_state_extractor(dialog_state_extractor) form_field_extractor(form_field_extractor) answer_writer(answer_writer) legacy_answer_writer(legacy_answer_writer) follow_up_suggester(follow_up_suggester) answer_suggester(answer_suggester) finalize(finalize) redirect_handler(redirect_handler) booking_handler(booking_handler) __end__([<p>__end__</p>]):::last __start__ --> intent_classifier; __start__ --> interest_signals_detector; __start__ --> retrieval_task_starter; __start__ --> third_party_enricher; __start__ --> visitor_profiler; action_router -.  booking  .-> booking_handler; action_router -.  redirect  .-> redirect_handler; action_router -.  answer  .-> retrieval_awaiter; answer_suggester --> finalize; answer_writer -.-> answer_suggester; answer_writer --> dialog_state_extractor; answer_writer -.  skip_suggestions  .-> finalize; answer_writer -.-> follow_up_suggester; answer_writer --> form_field_extractor; booking_handler --> finalize; follow_up_suggester --> finalize; form_field_extractor --> finalize; intent_classifier --> skill_selector; interest_signals_detector --> skill_selector; legacy_answer_writer -.-> answer_suggester; legacy_answer_writer --> dialog_state_extractor; legacy_answer_writer -.  skip_suggestions  .-> finalize; legacy_answer_writer -.-> follow_up_suggester; legacy_answer_writer --> form_field_extractor; redirect_handler --> finalize; retrieval_awaiter -.  new_writer  .-> answer_writer; retrieval_awaiter -.  legacy_writer  .-> legacy_answer_writer; skill_selector -.  new_system  .-> action_router; skill_selector -.  legacy_system  .-> retrieval_awaiter; dialog_state_extractor --> __end__; finalize --> __end__; retrieval_task_starter --> __end__; third_party_enricher --> __end__; visitor_profiler --> __end__; classDef default fill:#f2f0ff,line-height:1.2 classDef first fill-opacity:0 classDef last fill:#bfb6fc

Multi-Agent Router Architecture (WIP)¶

Partial Implementation

This architecture is partially implemented. Only the redirect handler agent is active, and only in test/development environments. Production uses the legacy legacy_answer_writer with monolithic prompts.

See ADR: Prompt Modularization for the full design.

Overview¶

The multi-agent router architecture replaces the monolithic prompt approach with specialized agents for different visitor intents. Instead of one large prompt handling all scenarios, the system:

Classifies intent using a fast LLM (gpt-4.1-nano)
Routes to specialized agents based on intent + interest signals
Uses 3-level prompt hierarchy for each agent (meta-template → agent template → client instructions)

Intent Classification¶

The intent_classifier node classifies each message into one of 5 visitor intents:

Intent	Description	Example
`LEARN`	Product questions, feature inquiries	"How does your A/B testing work?"
`CONTEXT`	User sharing business context	"We have 50k monthly visitors"
`SUPPORT`	Existing customer issues	"I can't log into my dashboard"
`OFFTOPIC`	Unrelated to product	"What's the weather today?"
`OTHER`	Job inquiries, press, partnerships	"Are you hiring?"

The classifier runs in parallel with other background nodes from START, using the Langfuse prompt rose-internal-intent-router.

Action Routing¶

The action_router node uses deterministic logic (no LLM) to decide the next action based on:

Current visitor intent
Cumulative interest score (from interest_signals_detector)
Site-specific interest threshold (from agent_config table)

Action	Trigger	Handler
`EDUCATE`	LEARN intent	`legacy_answer` (planned: educator agent)
`QUALIFY`	CONTEXT intent	`legacy_answer` (planned: qualifier agent)
`PROPOSE_DEMO`	Qualified + buying signals	`legacy_answer` (planned: CTA agent)
`HANDLE_SUPPORT`	SUPPORT intent	`redirect_handler` ✅
`HANDLE_OFFTOPIC`	OFFTOPIC intent	`redirect_handler` ✅
`HANDLE_OTHER`	OTHER intent	`redirect_handler` ✅

Redirect Handler¶

The redirect_handler is the only specialized agent currently implemented. It handles support, off-topic, and other requests by redirecting users to appropriate resources.

3-Level Prompt Hierarchy:

rose-internal/response-agents/meta-template (Level 1 - shared)
  └── {{lf_agent_instructions}} ← Agent template inserted
        └── rose-internal/response-agents/redirect/template (Level 2)
              └── {{lf_client_agent_instructions}} ← Client instructions inserted
                    └── rose-internal/response-agents/redirect/instructions/{domain} (Level 3)

Key features:

Skips RAG retrieval: The redirect handler cancels the retrieval task for faster responses since it doesn't need knowledge base content. This saves 10-30% retrieval costs on redirect cases.
Uses gpt-4.1-nano: Optimized for fast, lightweight redirect responses.

Environment-Based Routing¶

Environment	Support/Offtopic/Other	Educate/Qualify/Demo
Production	`legacy_answer`	`legacy_answer`
Test/Development	`redirect_handler` ✅	`legacy_answer`

Current Limitations¶

Educator agent: Not implemented (routes to legacy_answer)
Qualifier agent: Not implemented (routes to legacy_answer)
CTA/Demo agent: Not implemented (routes to legacy_answer)
A/B testing: No framework for comparing router vs monolithic performance

Deferred Retrieval Pattern¶

The graph uses a deferred retrieval pattern to optimize performance and reduce costs:

retrieval_task_starter fires the retrieval task at START without waiting (fire-and-forget)
action_router decides the path based on intent + signals (doesn't need retrieval results)
Answer path: retrieval_awaiter awaits the deferred task before legacy_answer_writer
Redirect path: redirect_handler cancels the retrieval task (saves 10-30% retrieval costs)

This pattern ensures retrieval only happens when needed (answer path), avoiding wasted work on redirects.

Parallel Execution and Race Conditions¶

Five nodes start in parallel from START:

START ──┬── retrieval_task_starter ──→ END (fires async task, doesn't wait)
        ├── third_party_enricher ──→ END (background)
        ├── visitor_profiler ──→ END (background)
        ├── interest_signals_detector ──→ END (background)
        └── intent_classifier ──→ END (background)

After background nodes complete, action_router routes to the appropriate handler:

action_router ──┬── retrieval_awaiter ──→ legacy_answer_writer ──→ suggestions (answer path)
                └── redirect_handler ──→ finalize (redirect path, cancels retrieval)

Race condition behavior:

action_router waits ONLY for intent_classifier and interest_signals_detector (doesn't wait for retrieval)
Background nodes (third_party_enricher, visitor_profiler, interest_signals_detector, intent_classifier) run in parallel
If background nodes complete before action_router starts → their data IS available in state
If background nodes are still running → action_router proceeds WITHOUT waiting (uses default intent)

This means enrichment data (company name, sector, interest signals, intent) is available to the response handler on a best-effort basis. Any data not ready in time is persisted for the next conversation turn.

Streaming Architecture¶

The chatbot uses a two-phase streaming approach:

Phase 1: Token Streaming (response handler)¶

Client ← token ← token ← token ← ... ← legacy_answer_writer or redirect_handler node

Uses LangGraph's astream_events API to capture on_chat_model_stream events
Only streams from response handler nodes (legacy_answer_writer, redirect_handler) - other nodes are filtered out to avoid streaming JSON
Tokens sent as Server-Sent Events: {"type": "token", "content": "..."}

Phase 2: Completion Event (after finalize)¶

Client ← complete event ← API fetches final state from graph

After streaming completes, the API layer:

Waits for all graph nodes to complete (including background nodes)
Fetches final state: chatbot.graph.aget_state(config_dict)
Extracts from state:
- suggested_follow_ups - Follow-up questions from follow_up_suggester
- suggested_answers - Answer options from answer_suggester
- cta_url_overrides - Dynamic CTA URLs from form_field_extractor
- visitor_profile - Enriched company data
Sends completion event: {"type": "complete", "metadata": {...}}

Why This Design?¶

Fast time-to-first-token: User sees response immediately from the response handler
Best-effort enrichment: Background data used if ready, otherwise next turn
Complete data at end: Suggestions and metadata require all nodes to finish

Code Flow¶

# chatbot.py - Streams only response handler tokens
async for event in self.graph.astream_events(...):
    if event_type == "on_chat_model_stream":
        node_name = metadata.get("langgraph_node", "")
        if node_name not in ("legacy_answer_writer", "redirect_handler"):  # Skip other nodes
            continue
        yield chunk.content  # Stream token to client

# chat.py (API) - Fetches final state after streaming
state = await chatbot.graph.aget_state(config_dict)
suggested_follow_ups = state.values.get("suggested_follow_ups", [])
suggested_answers = state.values.get("suggested_answers", [])
# ... send completion event with metadata

Execution Flow¶

START: Five nodes launch in parallel
- retrieval_task_starter - Fires retrieval task asynchronously (doesn't wait)
- third_party_enricher - Enriches visitor profile from IP
- visitor_profiler - Infers company/sector from conversation
- interest_signals_detector - Detects buying signals
- intent_classifier - Classifies visitor intent using LLM (gpt-4.1-nano)
Action Routing: The action_router decides based on intent + signals (doesn't wait for retrieval):
- Support/Offtopic/Other intents (test/dev) → redirect_handler (cancels retrieval task)
- All other cases → retrieval_awaiter → legacy_answer_writer
Answer Path: Sequential response generation with retrieval
- retrieval_awaiter - Awaits deferred retrieval task
- legacy_answer_writer - Generates LLM response with RAG context
- suggestion_router - Routes to appropriate suggester
Redirect Path: Fast response without retrieval
- redirect_handler - Cancels retrieval task, generates redirect response (gpt-4.1-nano)
- Goes directly to finalize (no suggestions needed)
Conditional Routing: After legacy_answer_writer, the suggestion_router decides:
- Response contains 👉 → answer_suggester (generate answer options)
- Response contains 💌 or URLs → skip_suggestions (go to finalize)
- Default → follow_up_suggester (generate follow-up questions)
Background Nodes: Run in parallel without blocking response
- dialog_state_extractor - Extracts emoji markers and emails
- form_field_extractor - Extracts form field values for CTA URLs

Graph Nodes¶

Node	Type	Blocking	Purpose
`retrieval_task_starter`	async	No	Fires retrieval task asynchronously at START (deferred pattern)
`retrieval_awaiter`	async	Yes	Awaits deferred retrieval task on answer path only
`third_party_enricher`	async	No	Enriches visitor profile from IP address
`visitor_profiler`	async	No	Infers company/sector from conversation
`interest_signals_detector`	async	No	Detects buying signals (engagement, pricing interest)
`intent_classifier`	async	No	Classifies visitor intent using LLM (gpt-4.1-nano)
`action_router`	sync	N/A	Determines next action based on intent + signals
`legacy_answer_writer`	async	Yes	Generates LLM response with RAG context (formerly `legacy_answer`)
`redirect_handler`	async	Yes	Handles support/offtopic/other redirects, cancels retrieval (test/dev only, uses gpt-4.1-nano)
`suggestion_router`	sync	N/A	Routes to appropriate suggester based on response
`answer_suggester`	async	Yes	Generates suggested answers (when bot asks questions)
`follow_up_suggester`	async	Yes	Generates follow-up questions
`dialog_state_extractor`	async	No	Extracts emoji markers and captured emails
`form_field_extractor`	async	No	Extracts form field values for CTA URLs
`finalize`	sync	Yes	Assembles final response for client

State Model¶

The graph uses RoseChatState (TypedDict) with custom reducers for parallel updates:

Core Fields¶

Field	Type	Description
`messages`	`list[BaseMessage]`	Conversation history
`input`	`str`	Current user input
`response`	`str`	Generated LLM response
`retrieved_docs`	`str`	Context from LightRAG
`site_name`	`str`	Client site identifier
`session_id`	`str`	Conversation session ID
`turn_number`	`int`	Current conversation turn (0-indexed)

Profile & Signals¶

Field	Type	Reducer
`visitor_profile`	`VisitorProfile`	`merge_visitor_profiles`
`dialog_supervision_state`	`DialogSupervisionState`	`merge_dialog_supervision_states`
`interest_signals_state`	`InterestSignalsState`	`merge_interest_signals_states`
`form_collection_state`	`FormCollectionState`	`merge_form_collection_states`

Intent & Action Router State¶

Field	Type	Description
`intent_classification_state`	`IntentClassificationState`	Current intent + history
`action_router_state`	`ActionRouterState`	Next action + reasoning

VisitorIntent enum values: LEARN, CONTEXT, SUPPORT, OFFTOPIC, OTHER

NextAction enum values: EDUCATE, QUALIFY, PROPOSE_DEMO, HANDLE_SUPPORT, HANDLE_OFFTOPIC, HANDLE_OTHER, CONTINUE

Custom Reducers¶

Parallel nodes update state using custom merge functions:

merge_visitor_profiles: Merges enrichment results, preferring non-"unknown" values
merge_dialog_supervision_states: Cumulative "ever" flags + latest turn flags
merge_interest_signals_states: Simple replacement
merge_form_collection_states: Merges collected values, tracks max turn

Memory Management¶

Session state is persisted using LangGraph checkpointers:

Mode	Backend	Use Case
Redis	`AsyncRedisSaver`	Production (distributed)
Memory	`MemorySaver`	Development/Testing

Configuration:

TTL: Configurable session timeout
Keepalive: TCP socket keepalive enabled
Health checks: 30-second pings prevent idle disconnection

# Memory manager initialization
memory_manager = IXChatMemoryManager()
checkpointer = await memory_manager.get_checkpointer()
graph = graph_builder.compile(checkpointer=checkpointer)

Enrichment System¶

Multi-source visitor enrichment pipeline with priority-based fallbacks:

Priority	Source	Description
1	Redis Cache	Fast, short-lived cache
2	Supabase Lookup	IP hash lookup for returning visitors
3	Browser Reveal	Client-side data (`window.reveal`)
4	Snitcher Radar	Session UUID identification
5	Enrich.so	Server-side API fallback

Once a source returns "completed" status, remaining sources are skipped.

VisitorProfile Fields¶

Enrichment: status, tier, source, ip_address
Company: company_name, company_description, company_domain, sector, sub_sector
User Context: email, job_to_be_done, feature_list, intent
Confidence: sector_confidence_level

Integration Points¶

System	Purpose	Package
LightRAG	Document retrieval with graph & chunk ranking	`ixrag`
Supabase	Conversation storage, client configs, lead data	`ixdata`
LangFuse	Observability & tracing	`ixllm`
Azure OpenAI	LLM client	`ixllm`
Redis	Session checkpointing	`ixchat.memory`

Key Files¶

File	Description
`ixchat/__init__.py`	Public API: `get_chatbot_service()`
`ixchat/service.py`	`IXChatbotService` singleton manager
`ixchat/chatbot.py`	`IXChatbot` with LangGraph orchestration
`ixchat/graph_structure.py`	Graph structure with node/edge definitions
`ixchat/memory.py`	`IXChatMemoryManager` for session persistence
`ixchat/nodes/`	Node implementations
`ixchat/nodes/intent_classifier.py`	Intent classification using LLM
`ixchat/nodes/action_router.py`	Deterministic action routing
`ixchat/nodes/redirect_handler.py`	Redirect agent for support/offtopic/other
`ixchat/nodes/retrieval_task_starter.py`	Fires retrieval task asynchronously
`ixchat/nodes/retrieval_awaiter.py`	Awaits deferred retrieval task
`ixchat/retrieval_task_store.py`	Manages retrieval tasks for cancellation
`ixchat/pydantic_models/`	State definitions and reducers
`ixchat/pydantic_models/intent_router.py`	Intent/action router state models
`ixchat/utils/agent_config.py`	`AgentConfigResolver` for site-specific config
`ixchat/enrichment/`	Multi-source visitor enrichment

Usage¶

from ixchat import get_chatbot_service

# Get singleton service
service = get_chatbot_service()

# Get chatbot for a site
chatbot = await service.get_chatbot("example-site")

# Query with streaming
async for chunk in chatbot.query_stream(
    input="Tell me about your product",
    site_name="example-site",
    session_id="session-123",
    person_id="posthog-distinct-id",
):
    print(chunk, end="")

# Non-streaming query (for evaluations)
response, metadata = await chatbot.query(
    input="What are your pricing plans?",
    site_name="example-site",
    session_id="session-123",
)

Evaluations¶

The just eval command runs LLM evaluation tests for quality assessment and regression testing of ixchat components using Langfuse datasets.

How It Works¶

Langfuse Dataset ──→ Evaluator ──→ Classifier (LLM) ──→ Results logged to Langfuse
   (labeled examples)              (real API calls)        (runs + scores)

Test data is fetched from Langfuse datasets (labeled input/expected_output pairs)
Evaluator runs the classifier on each dataset item
Results are logged back to Langfuse as runs with scores (correct, confidence, F1, etc.)
Metrics are computed (accuracy, F1, precision, recall) and asserted against thresholds

Usage¶

cd backend

# Run a specific evaluation
just eval intent-classifier    # Intent classification accuracy
just eval skill-selector       # Skill selection accuracy
just eval e2e-api              # End-to-end API evaluation

# Run all evaluations
just eval all

Available Targets¶

Target	Langfuse Dataset	Description
`intent-classifier`	`intent-classifier`	Tests intent classification (LEARN, CONTEXT, SUPPORT, OFFTOPIC, OTHER)
`skill-selector`	`skill-selector`	Tests skill/action routing decisions
`e2e-api`	`main-dataset`	End-to-end API response quality

Langfuse Dataset Structure¶

Each dataset item in Langfuse should have:

Field	Description	Example
`input`	Classifier input (dict or string)	`{"message": "How does pricing work?", "history": [...]}`
`expected_output`	Expected classification result	`{"intent": "LEARN"}`
`metadata`	Optional context	`{"source": "production", "site_name": "example"}`

Adding Traces to Datasets¶

To expand test coverage, add production traces to Langfuse datasets:

Option 1: Langfuse UI

Go to Traces in Langfuse
Find a trace with interesting/edge-case behavior
Click Add to Dataset → select target dataset
Fill in the expected_output (ground truth label)

Option 2: Langfuse API

from langfuse import Langfuse

langfuse = Langfuse()

# Add item to existing dataset
langfuse.create_dataset_item(
    dataset_name="intent-classifier",
    input={"message": "Can you help me debug?", "history": []},
    expected_output={"intent": "SUPPORT"},
    metadata={"source": "manual", "notes": "Edge case for support detection"}
)

Environment Configuration¶

The eval command automatically configures:

IX_LANGFUSE_ENABLED=true - Enables Langfuse for real prompt fetching
LANGFUSE_ENABLED=true - Langfuse integration flag
IX_ENVIRONMENT=test - Uses test environment (overridden to development for credentials)

Test Markers¶

@pytest.mark.evaluation      # Marks as evaluation test
@pytest.mark.llm_integration # Requires real LLM API calls

Running just eval all filters: -m "evaluation and llm_integration"

Results in Langfuse¶

After running evaluations, results appear in Langfuse:

Score	Description
`correct`	Per-item: 1.0 if prediction matches expected, 0.0 otherwise
`confidence`	Per-item: Model confidence score (if available)
`macro_f1`	Aggregate: Macro-averaged F1 score across all classes
`weighted_f1`	Aggregate: Weighted F1 score
`accuracy`	Aggregate: Overall accuracy
`passed`	Aggregate: 1.0 if F1 >= threshold, 0.0 otherwise

Quality Thresholds¶

Default thresholds (configurable in conftest.py):

Metric	Threshold	Description
Macro F1	0.80	Overall classification quality
Min Class F1	0.60	No single class below this
Skill Recall	0.90	Multi-label skill coverage
Answer Accuracy	0.70	E2E response quality