Skip to content

LLM Routing Architecture Plan

Date: 2026-03-10 Scope: Simplify model/provider selection and fallback handling across the backend Goal: Define a path toward one centralized, incremental model-routing system where each LLM use case can choose its provider, model, execution behavior, and ordered fallbacks independently, using one coherent policy model across the system, while preserving the current Langfuse tracing and prompt-management integration, avoiding a big-bang rewrite, and making provider/model switches easy and safe inside one code-owned routing layer rather than scattering them across the codebase.


Executive Summary

The simplification opportunity is not primarily a provider switch. The main win is to unify routing policy so every LLM use case resolves provider, model, and fallback behavior through one internal system: policies defined per-package, assembled and resolved through ixllm.

The chosen direction is:

  1. Make route resolution request-scoped and non-mutating first.
  2. Introduce a unified policy registry keyed by use case, with policies defined per-package and assembled at startup.
  3. Add workload-specific backend adapters (chat, embedding, rerank) plus capability validation.
  4. Move selection and fallback construction into one ixllm resolver.
  5. Keep LangChain/LangGraph as the execution abstraction.
  6. Keep Langfuse as the observability layer.
  7. Migrate call sites incrementally, starting with ixchat chat-model use cases.

This gets you closer to the actual goal: being able to switch models and providers easily without repeatedly rewriting provider-specific wiring.

This routing layer is intended to become the single routing control plane for production backend LLM traffic, not just ixchat. That includes request-time chat/RAG workloads and production batch/API analytics workloads such as ixtagging, which currently instantiate models directly with create_llm_client(...). Evaluation and one-off tooling flows are different and do not need to be folded into that control plane in V1.

For the first implementation, the intended scope is narrower than the final target architecture:

  • implement request-scoped routing first, even if it initially wraps existing factories and fallback wrappers
  • keep routing policy code-owned through typed profiles and use-case policies
  • keep one environment-invariant routing policy for deployed environments so lower environments validate the same routes that production will use
  • retire TOML-backed model/provider/fallback routing and make Python policy/default code the only routing source of truth
  • remove prompt-level, runtime, and pipeline-level model/provider override capability from live traffic entirely, including the public request surface
  • do not add per-use-case TOML/env routing overrides
  • do not add tenant-level routing overrides
  • do not make OpenRouter a V1 architectural dependency or system boundary, even though OpenRouter provider support is now already implemented in ixllm and used by the answer-node fallback path

OpenRouter is still relevant, but not as the first architectural move. It is now already available as a supported provider in ixllm and already used by the answer node's dedicated fallback path. That changes the implementation baseline, but not the architectural conclusion: OpenRouter can be a useful backend for selected chat workloads and may later reduce a lot of chat-specific provider wiring, but it does not solve the full problem by itself:

  • it does not fix fragmented routing ownership
  • it does not replace Azure embeddings
  • it does not replace rerank-specific paths
  • it does not remove the need for explicit use-case policy
  • it can hide provider selection unless your own resolver normalizes routing metadata

So the right sequence is to build your internal routing layer first, then optionally add OpenRouter behind that layer for selected chat workloads.


V1 Decisions

These decisions are part of the plan, not follow-up questions:

  • Live routing uses only the code-owned Python routing config. There is no prompt-level, runtime, or pipeline-level model/provider override path.
  • The production request pipeline does not expose model/provider override capability. Public request models should not accept live routing overrides, and there is no hidden or admin-only override mode in production traffic.
  • Live routing is environment-invariant across deployed environments. Development, staging, and production validate the same use-case routing policy instead of carrying different model/provider/fallback mixes for the same task.
  • Prompt-side model/provider metadata is not part of live routing in V1. During migration it may remain in local prompt frontmatter or Langfuse prompt config for compatibility with unmigrated paths, but migrated paths ignore it and the metadata is removed after cutover.
  • The target control plane is all production backend LLM use cases, including request-time paths and batch/API analytics workloads such as ixtagging. Evaluation and one-off tooling flows can continue using low-level factories directly in V1.
  • TOML stops being a routing authority. Model/provider/fallback keys move into Python policy and defaults, and TOML keeps only non-routing application settings.
  • Langfuse remains for tracing, prompt management, and observability, but not as a routing surface. If someone wants to try another model for an existing trace, they should copy the resolved prompt into Langfuse Playground and run that experiment there.
  • Route selection is request-scoped. Migrated paths do not mutate chatbot.llm; the active route lives in the resolved use-case handle for that request.
  • Warmup is derived from the resolver's internal policy registry so startup-prewarmed clients stay aligned with the same routing config that serves requests.
  • The resolver owns execution policy end to end: timeout, retry budget, fallback chain, fallback triggers, code fallback, and streaming failover semantics.

Current Structural Problems

The current implementation is harder to change than it needs to be because every package that calls an LLM has independently evolved its own model selection, and no two packages do it the same way.

ixchat: fragmented selection with hidden statefulness

ixchat has the most routing code and the most routing problems:

  • nodes carry their own default model/provider choices
  • ixchat.utils.model_override.get_llm_for_prompt() applies prompt overrides plus some provider-specific fallback behavior
  • ixchat.service.get_llm() adds a separate Azure-specific fallback path
  • ixllm.fallback_llm contains the generic fallback wrapper, but higher layers still decide when and how to use it
  • Langfuse prompt metadata can override model/provider today, creating a second config surface for routing policy

The result is that changing one chat use case's model can require edits across node defaults, prompt metadata, service-level override logic, and fallback wrappers.

ixtagging: hardcoded models with no fallback

ixtagging has three separate LLM consumers — scorer.py (helpfulness and resolution scoring), translator.py (message translation), and classifier.py (conversation classification) — each calling create_llm_client() directly with hardcoded openai/gpt-4.1-mini. There is:

  • no fallback chain: if OpenAI goes down, all scoring, classification, and translation silently fail
  • no shared execution policy: each class manages its own timeout and retry behavior
  • no single place to change the model: switching from gpt-4.1-mini to another model means editing three files independently

ixrag / LightRAG: parallel client construction stack

ixrag.lightrag.lightrag_llm duplicates the entire client construction layer outside ixllm. It has its own create_openai_client(), create_azure_openai_client(), and create_groq_client() functions, its own retry wrappers (gpt_openai_with_retry, groq_complete_with_retry), and its own embedding client (AzureOpenAIEmbeddings instantiated directly). Model selection is config-driven through a separate TOML path (config.lightrag.*), not through ixllm. Embeddings and rerank are entirely separate from any shared infrastructure.

Background jobs: inherited fragmentation with no observability

Production background jobs (run_classification.py, run_document_loader.py) inherit whatever routing pattern ixtagging or ixrag provides. They have no routing observability — there is no way to know after the fact which model served a scoring or document-processing request, and no fallback if the hardcoded provider goes down.

The common pattern

Every consumer group has ended up with its own model selection, its own fallback behavior (or none), and its own construction path. The specific problems differ — ixchat has statefulness and concurrency bugs, ixtagging has no resilience, ixrag has a duplicated construction stack — but the root cause is the same: there is no shared routing layer that all production LLM consumers resolve through.

There is also hidden statefulness that makes routing harder to reason about:

  • prompt metadata can silently alter routing policy today
  • the request API still exposes a model override input today, creating a third live routing surface
  • some request paths mutate the shared chatbot instance for tracing
  • the chatbot singleton is still used in some paths as the carrier of the "currently active" LLM for the request
  • fallback truth is reconstructed in call sites instead of returned by the routing layer
  • startup warmup hardcodes a model matrix outside the routing policy itself

This plan removes that ambiguity by making the Python routing config the only live routing source and leaving Langfuse in an observability and prompt-management role only.


Chosen Direction

The target architecture is one unified policy registry (assembled from per-package policy definitions), one backend adapter layer, and one resolver API.

The architectural boundary should be:

Call Site
  -> resolve_use_case(use_case)
  -> uses policy + adapter + capability checks
  -> returns workload-specific resolved handle + resolved route metadata
  -> invoke/embed/rerank with Langfuse callback handler

This keeps:

  • LangChain/LangGraph as the execution abstraction
  • Langfuse as the observability abstraction
  • provider/model policy resolved through one system (defined per-package, assembled at startup)

It also adds:

  • backend-specific construction hidden behind workload adapters
  • explicit compatibility checks before invocation
  • one place to determine which clients should be prewarmed

Why this direction fits the existing codebase

  • ixllm already owns the reusable client factory and fallback primitives.
  • The main pain is that routing decisions are scattered, not that there is no provider abstraction at all.
  • The codebase already has distinct runtime domains (chat, embeddings, rerank), so a workload-aware resolver fits the current shape better than a chat-only abstraction.
  • The backend also has production batch/API LLM consumers outside ixchat, so a shared ixllm resolver (with per-package policy definitions assembled at startup) is a better fit than letting each package grow its own model-selection layer.
  • This can be migrated one call site at a time without a big-bang rewrite.

What this direction does not require

  • No immediate switch away from Azure/OpenAI/Cerebras/Groq.
  • No architecture-wide OpenRouter migration beyond the already-landed provider support and answer-node fallback usage.
  • No immediate LightRAG redesign.
  • No requirement to force one identical raw config schema into every layer on day one.

Why OpenRouter Is Not The First Step

OpenRouter is now already implemented as a supported provider in ixllm.client_factory and is already used in one production chat path as the answer node's fallback route. It could become even more useful later, especially for chat.

It is appealing because it can provide:

  • model fallback via a models list
  • provider routing via provider ordering
  • provider capability filtering
  • an OpenAI-compatible or LangChain-friendly chat interface

That makes it a strong candidate to simplify many chat-generation paths. In a later phase, it could plausibly become the default backend for a large share of standard chat use cases, including:

  • answer generation
  • intent and classification tasks
  • follow-up and answer suggestion tasks
  • skill selection

That would reduce a lot of direct provider-specific chat wiring.

However, it is not the right first architectural move because it does not solve the main source of complexity in this codebase.

If you adopt OpenRouter before centralizing policy, the codebase still keeps the main architectural problems:

  • node-local defaults
  • split override logic
  • ixchat service special cases
  • separate LightRAG construction
  • multiple policy surfaces

It also does not solve everything even after adoption:

  • it does not replace Azure embeddings
  • it does not replace the rerank provider path
  • it does not remove the legacy prompt-level override behavior that should disappear from live routing
  • it does not remove the need to report requested vs actual route consistently

So the correct posture is:

  • do not make OpenRouter the system boundary
  • do not use it as a substitute for a routing architecture
  • treat the already-landed OpenRouter usage as one backend choice that must eventually be represented behind your own resolver like any other provider

That preserves architectural control while still letting you capture most of the operational upside later.


Policy Model

The policy registry is the backbone of the design, but it must do more than just centralize hardcoded tuples.

Core principles

  • The registry should be keyed by use case, not by provider.
  • Use-case policies should be defined by the package that owns the use case (ixchat, ixtagging, ixrag), not centralized in ixllm. ixllm owns the types, profiles, resolver, and adapters. The app layer assembles per-package policies into the resolver at startup.
  • The registry should be the typed default policy layer, not the only policy source.
  • Chat, embedding, and rerank should share one conceptual routing model, but not be forced through one runtime shape.
  • groq_direct should remain a valid primary route for latency-sensitive chat-shaped workloads such as query_rewriter.
  • Live request routing should come only from the code-owned routing config.
  • Non-LLM code fallback should remain representable for cases like query rewriting.

Workload-specific adapters

The resolver should dispatch through workload-specific adapters:

  • ChatBackend
  • EmbeddingBackend
  • RerankBackend

This is the missing piece if the goal is "one interface" without relying on OpenRouter as the system boundary.

Each adapter should declare what it supports:

  • structured output
  • streaming
  • tool calling
  • supported execution knobs (timeout, retry, failover mode)
  • provider-specific request options
  • provider-specific invocation strategy when a capability is supported

This matters because a resolver that only returns Any still leaves the codebase dependent on provider quirks. The adapter layer is what makes provider/model swaps predictable.

That last point is important. Capability validation answers "can this route support structured output at all?" but safe provider switching also depends on "what is the correct invocation recipe for this provider?" In the current codebase, follow_up_suggester and answer_suggester already show why this matters: they do not just check whether structured output is available, they also choose json_schema vs json_mode based on provider and append an explicit JSON-format hint for non-OpenAI-style providers. That behavior belongs in the adapter boundary rather than staying embedded in call sites.

The ownership split should be strict:

  • the resolver owns business-level execution policy
  • adapters translate that policy into provider-specific client options
  • client_factory.py builds and caches clients from explicit inputs
  • fallback wrappers execute an already-resolved fallback chain rather than deciding policy themselves

Adapter layer: V1 vs later

The adapter layer should be introduced in V1, but it does not need to become the full provider abstraction immediately.

In V1, the adapter layer should mainly:

  • provide an explicit runtime boundary between the resolver and provider construction
  • validate capabilities before invocation
  • own provider-specific invocation strategy for migrated capabilities such as structured output
  • choose the correct low-level provider constructor
  • translate resolver-owned execution policy into explicit provider/client arguments
  • reuse the existing ixllm.client_factory and fallback wrappers where helpful, but only as execution primitives behind the resolver

That means the first implementation can look like:

  • AzureChatBackend.build_runnable(...) calling get_cached_llm_client(..., provider="azure")
  • OpenAIChatBackend.build_runnable(...) calling get_cached_llm_client(..., provider="openai")
  • the resolver composing ordered fallback with the existing generic fallback mechanism from explicit policy data

So in V1, the new adapter layer is mostly a formalization of an architectural boundary, not a requirement to rewrite every provider client path. What matters is that migrated paths no longer inherit timeout, retry, or fallback behavior from implicit defaults in client_factory.py or fallback-specific helper factories.

For example, migrated structured-output chat use cases should stop deciding in the node whether they need json_schema or json_mode, and should stop appending provider-specific JSON hints themselves. The resolved handle returned by the resolver should already embody the correct invocation strategy for the selected backend.

One cache rule should also be explicit in V1: cache identity must be derived from the fully translated backend client parameters for a resolved route, not just from provider + model. If a route-defining option changes the underlying client behavior, it must participate in cache identity too. That includes options such as reasoning settings, timeout/retry variants that produce distinct clients, and provider-specific transport options such as OpenRouter upstream selection. By contrast, invocation-only behavior such as structured-output method selection, prompt shaping, callback handlers, or code fallback should remain outside client cache identity.

After V1, the adapter layer should gradually become the real home of provider-specific runtime behavior.

That later evolution should include:

  • moving more provider branching out of client_factory.py and into explicit backend classes
  • centralizing capability handling such as structured output, streaming, tool calling, and provider-specific request options
  • normalizing provider exceptions into internal fallback reasons such as rate_limit, timeout, and provider_unavailable
  • shrinking or removing helper factories that bundle fixed policy choices such as "Azure with OpenAI rate-limit fallback" once those choices are fully resolver-owned
  • optionally returning richer wrappers instead of raw provider clients so actual route metadata and error normalization are captured consistently

In other words:

  • V1 introduces the adapter boundary
  • post-V1 makes the adapter layer the true abstraction boundary

This distinction is important for scope control. The first implementation should establish the boundary cleanly without forcing a big-bang rewrite of client_factory.py.

Capability validation

Without capability checks, "easy switching" becomes "easy to misconfigure."

For example, if intent_classifier requires structured output and someone switches it to:

primary = {"provider": "groq_direct", "model": "llama-3.1-8b-instant"}
structured_output = True

then a registry-only system may accept the configuration and fail later at runtime when the call site reaches with_structured_output(...).

With an explicit adapter plus capability contract, the resolver can fail fast during route resolution with a validation error such as:

intent_classifier requires structured output, but groq_direct does not support it

But capability validation alone is not enough for safe provider switching. A provider can support a capability in principle while still requiring a different invocation pattern. The current suggester nodes are a concrete example:

  • follow_up_suggester uses structured output today, but switches between json_schema and json_mode based on provider and appends a JSON hint for non-OpenAI-style providers
  • answer_suggester does the same

So a naive switch from azure/gpt-4.1-nano to cerebras/gpt-oss-120b could pass a boolean capability check for structured output while still behaving incorrectly if the call path kept the OpenAI-style invocation recipe. The adapter contract should therefore cover both:

  • whether the provider supports the capability
  • which invocation strategy should be used when it does

For structured output in V1, that means the chat backend should own at least:

  • the structured output method (json_schema vs json_mode)
  • whether prompt shaping is required for that provider, such as appending an explicit JSON-format instruction
  • any provider-specific arguments needed to make the capability reliable

This is the difference between "the configuration is valid" and "the provider switch is operationally safe."

Typed workload policies and normalized fallback reasons

The registry should not use generic provider: str and model: str fields as its final shape. That is too permissive for a system that must support chat, embeddings, and rerank reliably.

Bad shape:

@dataclass
class LLMPolicy:
    kind: str
    provider: str
    model: str

This makes invalid combinations easy to express:

LLMPolicy(kind="embedding", provider="cerebras", model="gpt-oss-120b")
LLMPolicy(kind="chat", provider="cohere", model="rerank-v3.5")
LLMPolicy(kind="rerank", provider="azure", model="gpt-4.1-mini")

Better shape:

from dataclasses import dataclass, field
from typing import Literal

FallbackReason = Literal[
    "rate_limit",
    "provider_unavailable",
    "timeout",
    "connection_error",
]

FallbackMode = Literal["immediate", "after_retries", "never"]
StreamingFailover = Literal["before_first_token_only", "never"]
ReasoningEffort = Literal["none", "low", "medium", "high"]


@dataclass(frozen=True)
class RetryPolicy:
    max_attempts: int
    backoff: Literal["none", "exponential"]


@dataclass(frozen=True)
class TimeoutPolicy:
    total_seconds: float


@dataclass(frozen=True)
class OpenAIRouteOptions:
    reasoning_effort: ReasoningEffort | None = None


@dataclass(frozen=True)
class OpenRouterRouteOptions:
    allowed_upstreams: tuple[Literal["azure", "openai"], ...] | None = None
    allow_provider_fallbacks: bool = True
    reasoning_effort: ReasoningEffort | None = None


@dataclass(frozen=True)
class AzureChatRoute:
    provider: Literal["azure"] = "azure"
    model: ChatModel = "gpt-4.1-mini"


@dataclass(frozen=True)
class OpenAIChatRoute:
    provider: Literal["openai"] = "openai"
    model: ChatModel = "gpt-4.1-mini"
    options: OpenAIRouteOptions = field(default_factory=OpenAIRouteOptions)


@dataclass(frozen=True)
class OpenRouterChatRoute:
    provider: Literal["openrouter"] = "openrouter"
    model: ChatModel = "openai/gpt-5.2"
    options: OpenRouterRouteOptions = field(default_factory=OpenRouterRouteOptions)


@dataclass(frozen=True)
class GroqDirectChatRoute:
    provider: Literal["groq_direct"] = "groq_direct"
    model: ChatModel = "llama-3.1-8b-instant"


ChatRoute = AzureChatRoute | OpenAIChatRoute | OpenRouterChatRoute | GroqDirectChatRoute


@dataclass(frozen=True)
class EmbeddingRoute:
    provider: EmbeddingProvider
    model: EmbeddingModel


@dataclass(frozen=True)
class RerankRoute:
    provider: RerankProvider
    model: RerankModel


@dataclass(frozen=True)
class ChatFallback:
    route: ChatRoute
    on: tuple[FallbackReason, ...]


@dataclass(frozen=True)
class CodeFallback:
    kind: Literal["python_pronoun_rewrite"]
    on: tuple[FallbackReason, ...]


@dataclass(frozen=True)
class ChatPolicy:
    # Exactly one of `primary` or `profile` should be set.
    primary: ChatRoute | None = None
    profile: str | None = None
    fallbacks: tuple[ChatFallback, ...] = ()
    code_fallback: CodeFallback | None = None
    structured_output: bool = False
    streaming: bool = False
    execution: "ChatExecutionPolicy | None" = None


@dataclass(frozen=True)
class ChatExecutionPolicy:
    timeout: TimeoutPolicy
    retry: RetryPolicy
    fallback_mode: FallbackMode
    streaming_failover: StreamingFailover


@dataclass(frozen=True)
class EmbeddingPolicy:
    primary: EmbeddingRoute


@dataclass(frozen=True)
class RerankPolicy:
    primary: RerankRoute
    on_failure: Literal["return_original_documents", "raise"] = "raise"

This makes invalid cross-workload combinations much harder to express and keeps fallback policy stable even if provider SDK exception classes change. It also makes operational behavior explicit, so switching models or providers does not silently inherit the wrong timeout, retry, or failover behavior from adapter defaults.

It also solves an important gap in the current codebase: some live routes are not fully defined by provider + model alone. The current answer_writer fallback is not just "OpenRouter with openai/gpt-5.2". It is OpenRouter pinned to Azure as the upstream, with OpenRouter-side provider fallback disabled, while the GPT-5 primary route also carries a provider-specific reasoning setting. Those are route-defining choices, not incidental helper behavior, so they should be representable in the policy model itself rather than hidden in factory code.

For example, the current answer-writer route is closer to this:

ChatPolicy(
    primary=OpenAIChatRoute(
        model="gpt-5.4",
        options=OpenAIRouteOptions(
            reasoning_effort="none",
        ),
    ),
    fallbacks=(
        ChatFallback(
            route=OpenRouterChatRoute(
                model="openai/gpt-5.2",
                options=OpenRouterRouteOptions(
                    allowed_upstreams=("azure",),
                    allow_provider_fallbacks=False,
                    reasoning_effort="none",
                ),
            ),
            on=("rate_limit", "provider_unavailable", "timeout", "connection_error"),
        ),
    ),
    streaming=True,
)

The adapter should then translate those typed route options into provider-specific SDK or transport arguments:

def build_openrouter_client(route: OpenRouterChatRoute) -> Any:
    extra_body = {}

    if route.options.allowed_upstreams is not None:
        extra_body["provider"] = {
            "only": list(route.options.allowed_upstreams),
            "allow_fallbacks": route.options.allow_provider_fallbacks,
        }

    if route.options.reasoning_effort is not None:
        extra_body["reasoning"] = {"effort": route.options.reasoning_effort}

    return get_cached_llm_client(
        model=route.model,
        provider="openrouter",
        extra_body=extra_body or None,
    )

The important design rule is:

  • typed route options belong in the policy model when they materially define the route's behavior
  • adapters translate those typed options into provider-specific request arguments
  • raw transport payloads such as extra_body should remain an adapter detail, not the public policy shape

CodeFallback is intentionally separate from provider fallbacks. It represents deterministic degraded-mode behavior owned by the resolver, not a second provider route. In V1, it should stay intentionally narrow and solve only the real case that exists today: query rewriter's Python fallback.

The intended contract in V1 is:

  • provider/model fallbacks remain part of the ordered fallbacks chain
  • code_fallback is attempted only after the LLM route fails with an allowed normalized fallback reason
  • the resolver executes the code fallback, not the call site
  • the code fallback runs on the use-case input, not on raw provider-specific prompt strings
  • the code fallback returns the same logical output type as the primary use case
  • the invocation result reports it as ActualExecution(target_type="code", ...)
  • prompt metadata may not introduce, replace, or modify a code_fallback
  • if the code fallback itself fails, the request fails; there is no second fallback layer behind it

This keeps non-LLM fallback logic inside the same routing control plane instead of scattering it across call sites.

The practical scope should be deliberately small:

  • keep CodeFallback resolver-owned
  • support exactly one concrete fallback kind in V1: python_pronoun_rewrite
  • use it only for query_rewriter in V1
  • do not introduce a generic plugin/registry system for arbitrary code fallbacks yet

For example, the current intent_classifier route becomes:

ChatPolicy(
    primary=AzureChatRoute(model="gpt-4.1-mini"),
    fallbacks=(
        ChatFallback(
            route=OpenAIChatRoute(model="gpt-4.1-mini"),
            on=("rate_limit",),
        ),
    ),
    structured_output=True,
    execution=ChatExecutionPolicy(
        timeout=TimeoutPolicy(total_seconds=30),
        retry=RetryPolicy(max_attempts=0, backoff="none"),
        fallback_mode="immediate",
        streaming_failover="never",
    ),
)

And the current query_rewriter route remains explicit about its special-case behavior:

ChatPolicy(
    primary=GroqDirectChatRoute(model="llama-3.1-8b-instant"),
    code_fallback=CodeFallback(
        kind="python_pronoun_rewrite",
        on=("timeout", "connection_error", "provider_unavailable"),
    ),
    structured_output=False,
    streaming=False,
    execution=ChatExecutionPolicy(
        timeout=TimeoutPolicy(total_seconds=1),
        retry=RetryPolicy(max_attempts=0, backoff="none"),
        fallback_mode="never",
        streaming_failover="never",
    ),
)

In this design, query_rewriter remains a chat-shaped use case, but its degraded-mode fallback is still owned by the resolver. The call site continues to call one resolved use case; it does not implement separate Python fallback logic itself.

Concretely, the resolver-owned path should behave like:

  1. try the configured LLM route for query rewriting
  2. if the failure reason matches the allowed CodeFallback.on reasons, run python_pronoun_rewrite
  3. return the rewritten query and report ActualExecution(target_type="code", target="python_pronoun_rewrite")

The current first-turn optimization in query rewriter should remain separate from CodeFallback. It is a fast-path rule ("no history, so don't call the LLM"), not a failure fallback.

Reusable route profiles

A use-case registry alone still encourages repetition when many use cases share the same route. If every use case restates the full provider/model/fallback tuple, large switches remain noisy and error-prone.

The safer pattern is:

  • keep the registry keyed by use case
  • allow a use case to reference a reusable route profile for common routes
  • let the use case specialize only what is specific to that workload

Policy ownership: per-package, not centralized in ixllm

Use-case policies should be defined by the package that owns the use case, not centralized in ixllm. This keeps the dependency direction clean: consumer packages depend on ixllm for types and the resolver, but ixllm never imports or knows about ixchat, ixtagging, or ixrag.

The ownership split is:

  • ixllm owns the framework: policy types (ChatPolicy, ChatRoute, etc.), route profiles (CHAT_ROUTE_PROFILES), the resolver, and backend adapters
  • each consumer package owns its use-case policies: ixchat defines policies for intent_classifier, answer_writer, etc.; ixtagging defines policies for conversation_classification, etc.; ixrag defines policies for lightrag_embedding, etc.
  • the app layer assembles all per-package policies into the resolver at startup

For example, shared route profiles live in ixllm:

# ixllm/routing/profiles.py
CHAT_ROUTE_PROFILES = {
    "chat.answer_primary": ChatRouteProfile(
        primary=OpenAIChatRoute(
            model="gpt-5.4",
            options=OpenAIRouteOptions(
                reasoning_effort="none",
            ),
        ),
        fallbacks=(
            ChatFallback(
                route=OpenRouterChatRoute(
                    model="openai/gpt-5.2",
                    options=OpenRouterRouteOptions(
                        allowed_upstreams=("azure",),
                        allow_provider_fallbacks=False,
                        reasoning_effort="none",
                    ),
                ),
                on=("rate_limit", "provider_unavailable", "timeout", "connection_error"),
            ),
        ),
    ),
    "chat.standard_primary": ChatRouteProfile(
        primary=AzureChatRoute(model="gpt-4.1"),
        fallbacks=(
            ChatFallback(
                route=OpenAIChatRoute(model="gpt-4.1"),
                on=("rate_limit",),
            ),
        ),
    ),
    "chat.fast_small": ChatRouteProfile(
        primary=AzureChatRoute(model="gpt-4.1-nano"),
        fallbacks=(
            ChatFallback(
                route=OpenAIChatRoute(model="gpt-4.1-nano"),
                on=("rate_limit",),
            ),
        ),
    ),
}

Each consumer package defines its own use-case policies:

# ixchat/routing.py
from ixllm.routing import ChatPolicy, AzureChatRoute, GroqDirectChatRoute, ...

IXCHAT_POLICIES = {
    "answer_writer": ChatPolicy(
        profile="chat.answer_primary",
        streaming=True,
    ),
    "dialog_supervisor": ChatPolicy(
        profile="chat.fast_small",
        structured_output=True,
    ),
    "intent_classifier": ChatPolicy(
        primary=AzureChatRoute(model="gpt-4.1-mini"),
        fallbacks=(...),
        structured_output=True,
    ),
    "query_rewriter": ChatPolicy(
        primary=GroqDirectChatRoute(model="llama-3.1-8b-instant"),
        code_fallback=CodeFallback(kind="python_pronoun_rewrite", ...),
    ),
    # ... other ixchat use cases
}
# ixtagging/routing.py
from ixllm.routing import ChatPolicy, OpenAIChatRoute

IXTAGGING_POLICIES = {
    "conversation_classification": ChatPolicy(
        primary=OpenAIChatRoute(model="gpt-4.1-mini"),
        structured_output=True,
    ),
    "message_translation": ChatPolicy(
        primary=OpenAIChatRoute(model="gpt-4.1-mini"),
    ),
    # ... other ixtagging use cases
}
# ixrag/routing.py
from ixllm.routing import EmbeddingPolicy, RerankPolicy, ...

IXRAG_POLICIES = {
    "lightrag_embedding": EmbeddingPolicy(primary=EmbeddingRoute(...)),
    "lightrag_rerank": RerankPolicy(primary=RerankRoute(...)),
    # ... other ixrag use cases
}

The app layer assembles them into the resolver at startup:

# app startup (e.g. service.py or app entrypoint)
from ixllm.routing import LLMResolver, CHAT_ROUTE_PROFILES
from ixchat.routing import IXCHAT_POLICIES
from ixtagging.routing import IXTAGGING_POLICIES
from ixrag.routing import IXRAG_POLICIES

resolver = LLMResolver(
    profiles=CHAT_ROUTE_PROFILES,
    policies={**IXCHAT_POLICIES, **IXTAGGING_POLICIES, **IXRAG_POLICIES},
)

This preserves per-use-case control while making bulk route changes operationally easier. It also keeps the dependency graph clean: ixchat → ixllm (imports types), never the reverse.

The important point is that the final schema should support both:

  • inline route definitions for one-off use cases
  • profile references for common shared routes

but it should do so through one explicit, internally consistent policy type rather than two parallel policy shapes.

Resolver-owned policy registry

For V1, routing policy should remain code-only. Profiles and use-case policies are enough to centralize ownership and simplify the architecture without adding a second deployment-time control surface or environment-specific routing layer for deployed environments.

That means the effective V1 policy is built by the resolver from:

  • code-defined route profiles (owned by ixllm)
  • code-defined use-case policies (owned by each consumer package: ixchat, ixtagging, ixrag, etc.)
  • assembled into the resolver by the app layer at startup

There is no second request-time routing merge layer in the plan. Live routing does not accept prompt-level, runtime, or pipeline-level model/provider overrides.

The practical approach is:

  • code owns the full typed policy model: ixllm owns the types and route profiles, each consumer package owns its use-case policies, and the app layer assembles them into the resolver
  • Python code also owns the static routing defaults; TOML is not a parallel routing authority in V1
  • the deployed routing policy is environment-invariant, so development, staging, and production all resolve the same use-case routes
  • profile is code-only in V1
  • model/provider/fallback selection is resolved from code-owned config only
  • the live request API does not accept model/provider overrides as alternate routing inputs
  • Langfuse prompt metadata is ignored for live route selection
  • prompt metadata may not modify code_fallback
  • prompt-side model/provider metadata may continue to exist temporarily in local prompt frontmatter or Langfuse prompt config while unmigrated paths still depend on it, but migrated paths must ignore it
  • once the affected use cases are migrated, that prompt-side routing metadata should be removed rather than kept as dead configuration
  • if someone wants to test another model for a traced prompt, they should copy the resolved prompt from the trace into Langfuse Playground instead of using a runtime override path
  • tests, evaluation, and ad hoc tooling may still mock or select models directly, but that is outside the deployed routing control plane and must not become a second live routing config surface

In other words, the system should share one conceptual policy shape, and the resolver should build one validated internal policy registry from it at startup while keeping all live routing policy ownership in code for the first implementation.

That also means:

  • model/provider/fallback routing should be removed from backend/apps/shared_data/config/*.toml
  • the live routing defaults currently coming from TOML must be copied into per-package Python policy definitions before TOML routing keys are removed
  • the migrated per-package Python policies, assembled into the resolver at startup, must become the single deployed routing policy across development, staging, and production rather than a base layer with environment-specific route overrides
  • prompt-side routing metadata (model / provider) should no longer be treated as authoritative once a use case is migrated to the resolver
  • during migration, prompt-side routing metadata can remain in place only for compatibility with unmigrated paths
  • after migration of the affected use cases, prompt-side routing metadata should be removed from both local prompt frontmatter and Langfuse prompt config
  • this convergence step must be explicit because the current Pydantic defaults do not match current TOML-backed runtime values for several routes, including the default chat route and LightRAG defaults
  • after that convergence, TOML may still carry non-routing settings, but not model/provider/fallback selection

Internal policy registry

The runtime should also have a distinct internal policy registry: the static routing view that the resolver builds from code-owned routing config during initialization.

This is operationally important because:

  • startup warmup should use it
  • boot-time validation should use it
  • debugging tools should be able to print it

Suppose the resolver's internal policy entry for answer_writer is:

primary: openai/gpt-5.4
fallback: openrouter/openai/gpt-5.2

The safer pattern is:

  1. Initialize the resolver from code-defined profiles + use-case policies.
  2. Let the resolver build one validated internal policy registry from them.
  3. Warm clients from that registry.
  4. Resolve requests from that same registry, then report actual execution if a configured fallback was used at runtime.

The startup contract should be explicit:

  • warmup must be driven from the resolver's internal policy registry, not from a hand-maintained model list in the app layer
  • warmup must include configured primaries and configured fallbacks
  • warmup must include any execution-policy variants that produce distinct cached clients, such as retry-mode differences that affect cache identity
  • warmup should remain best-effort operational behavior, but the source of truth for what gets warmed is still the resolver's internal policy registry built from code-owned routing config

This is better than today's implementation because today's startup warmup is a manual matrix in the app layer. That approach drifts as soon as routing changes, which is already visible in the current answer-writer path: the live route changed, but the warmup list did not. In the new design, warmup stays aligned automatically because it is derived from the same resolver-owned policy registry that serves requests.

Streaming failover semantics

Streaming should have an explicit failover rule instead of implicitly behaving like non-streaming invocation.

This is not just a design preference. It is a current correctness bug in the existing fallback wrapper: if the primary provider emits some streamed output and then fails, the fallback provider starts a second fresh generation. That can produce stitched output made of two different runs.

For example, the user might receive something like:

Rose helps B2B teams automate... Rose is an AI-first inbound marketing platform...

where the first fragment came from the primary provider and the second fragment came from the fallback provider restarting the answer.

For V1, the recommended rule is:

  • allow fallback only before the first token is emitted
  • after the first token, do not switch providers mid-stream
  • if the primary stream fails after the first token, surface the failure upward rather than splicing in fallback output

This avoids mixing partial output from one provider with continuation from another provider that does not share the same generation state.


Target Resolver Design

The main new abstraction should be one resolver in ixllm, not a new provider wrapper everywhere.

In simple terms:

  • the policy says the routing rule
  • the resolver applies that rule for one use case and one request
  • the backend adapter and client factory build the actual provider client underneath

So the resolver is the runtime layer that turns "this use case should use Azure with this fallback and this execution behavior" into "here is the exact thing to invoke for this request." That is why this abstraction exists. Without it, every call site has to keep re-implementing model selection, fallback construction, capability handling, and tracing metadata in slightly different ways.

The code in this section is intentionally pseudo code. It is here to illustrate the shape of the resolver, what layer should own it, and where it should be called from. Exact class names, method names, and module boundaries can differ in the implementation as long as the ownership model stays the same.

Minimal API surface

# The app layer assembles per-package policies into the resolver at startup.
resolver = LLMResolver(
    profiles=CHAT_ROUTE_PROFILES,
    policies={**IXCHAT_POLICIES, **IXTAGGING_POLICIES, **IXRAG_POLICIES},
)

resolved = resolver.resolve_use_case(
    use_case="intent_classifier",
)

result = await resolved.invoke(messages)

For non-chat workloads, the same resolver returns a different resolved handle:

resolved = resolver.resolve_use_case("lightrag_embedding")
embedding_result = await resolved.embed(texts)
resolved = resolver.resolve_use_case("lightrag_rerank")
rerank_result = await resolved.rerank(query, documents)

In practice, this resolver remains owned by ixllm, but each consumer package receives it through a different injection pattern. The app layer assembles the resolver at startup; each package uses whatever injection mechanism fits its runtime shape.

ixchat: request context through the graph

ixchat.service passes the resolver into the graph's request context. Nodes access it from there:

# ixchat service.py pseudo code
request_context = ChatRequestContext(
    chatbot=chatbot,
    resolver=llm_resolver,
)

graph.invoke(..., context=request_context)
# ixchat node pseudo code
resolved = request_context.resolver.resolve_use_case(
    use_case="intent_classifier",
)
result = await resolved.invoke(messages)

ixtagging: constructor injection

TaggingService currently lazily creates HelpfulnessScorer(), ConversationClassifier(), and MessageTranslator() with no-arg constructors. After migration, the resolver is passed at construction time:

# ixtagging/service.py pseudo code
class TaggingService:
    def __init__(self, resolver: LLMResolver, ...):
        self._resolver = resolver

    def _get_classifier(self) -> ConversationClassifier:
        if self._classifier is None:
            self._classifier = ConversationClassifier(resolver=self._resolver)
        return self._classifier
# ixtagging/classifier.py pseudo code — replaces create_llm_client() call
resolved = self._resolver.resolve_use_case("conversation_classification")
result = await resolved.invoke(messages)

ixrag / LightRAG: function factory capture

LightRAG uses callback-style functions (create_llm_function(), embedding_func_with_retry()) rather than class-based call sites. The resolver is captured by these function factories at construction time, replacing the direct client construction inside lightrag_llm.py:

# ixrag/lightrag/lightrag_llm.py pseudo code
def create_llm_function(resolver: LLMResolver) -> Callable:
    async def llm_func(prompt: str, ...) -> str:
        resolved = resolver.resolve_use_case("lightrag_retrieval")
        result = await resolved.invoke(messages)
        return result.output
    return llm_func
# ixrag embedding pseudo code
def create_embedding_function(resolver: LLMResolver) -> Callable:
    async def embed(texts: list[str]) -> list[list[float]]:
        resolved = resolver.resolve_use_case("lightrag_embedding")
        result = await resolved.embed(texts)
        return result.vectors
    return embed

Background jobs: app-layer assembly

Jobs like run_classification.py currently call get_tagging_service() which returns a TaggingService with no routing awareness. After migration, the job entrypoint assembles the resolver and passes it through:

# jobs/tagging/run_classification.py pseudo code
from ixllm.routing import LLMResolver, CHAT_ROUTE_PROFILES
from ixtagging.routing import IXTAGGING_POLICIES

resolver = LLMResolver(
    profiles=CHAT_ROUTE_PROFILES,
    policies=IXTAGGING_POLICIES,
)
service = get_tagging_service(resolver=resolver)

The job itself never touches routing. It just passes the assembled resolver to the service layer.

Key placement rules

  • service.py should not choose provider/model/fallback itself
  • IXChatbot should not store the active request route in chatbot.llm
  • ixtagging components should not call create_llm_client() directly
  • lightrag_llm.py should not duplicate client construction from ixllm
  • nodes, services, and function factories should ask ixllm to resolve the route for the specific use case they are about to execute

In practical terms, the new system should replace this pattern:

llm, original_llm, model_name, provider_name = get_llm_for_prompt(...)
chatbot.llm = llm
try:
    result = await some_node_logic(chatbot.llm, ...)
finally:
    chatbot.llm = original_llm

with this pattern:

resolved = request_context.resolver.resolve_use_case(
    use_case="answer_writer",
)
result = await resolved.invoke(messages)

The difference is architectural, not cosmetic:

  • in the old pattern, the chosen route is stored on a shared chatbot object
  • in the new pattern, the chosen route belongs to one request only
  • the chatbot remains a shared site-scoped container, but the active LLM route is request-scoped

Suggested return shapes

@dataclass(frozen=True)
class DeclaredRoute:
    kind: Literal["chat", "embedding", "rerank"]
    primary_target: str
    fallback_targets: tuple[str, ...] = ()


@dataclass(frozen=True)
class ActualExecution:
    target_type: Literal["provider", "code", "degraded"]
    target: str
    fallback_triggered: bool


@dataclass
class ResolvedChatUseCase:
    runnable: Any
    declared_route: DeclaredRoute
    execution: ChatExecutionPolicy
    capabilities: dict[str, bool]
    metadata: dict[str, Any]


@dataclass
class ChatInvocationResult:
    output: Any
    declared_route: DeclaredRoute
    actual_execution: ActualExecution
    metadata: dict[str, Any]


@dataclass
class ResolvedEmbeddingUseCase:
    declared_route: DeclaredRoute
    metadata: dict[str, Any]


@dataclass
class EmbeddingInvocationResult:
    vectors: list[list[float]]
    declared_route: DeclaredRoute
    actual_execution: ActualExecution
    metadata: dict[str, Any]


@dataclass
class ResolvedRerankUseCase:
    declared_route: DeclaredRoute
    metadata: dict[str, Any]


@dataclass
class RerankInvocationResult:
    documents: list[Any]
    declared_route: DeclaredRoute
    actual_execution: ActualExecution
    metadata: dict[str, Any]

This two-phase shape is important. Declared route information is available when the use case is resolved. Actual execution information is only knowable after the call completes. The shared routing metadata is consistent across workloads, but chat, embeddings, and rerank keep workload-appropriate method and result shapes.

Concrete example: resolve vs invoke

For intent_classifier, the resolved route can say:

declared_route.kind = chat
declared_route.primary_target = azure/gpt-4.1-mini
declared_route.fallback_targets = (openai/gpt-4.1-mini,)

But only the invocation result can truthfully say:

actual_execution.target_type = provider
actual_execution.target = openai/gpt-4.1-mini
actual_execution.fallback_triggered = true

if Azure rate-limits and the request falls back.

What still calls the provider API

The new policy layer does not remove the need for provider-specific calling code. It changes where that code lives and who chooses when to use it.

The intended runtime layering is:

Use-case policy
  -> resolver
  -> backend adapter
  -> provider SDK client
  -> provider API

That means:

  • the policy says which route should be used
  • the resolver decides which backend adapter to use for this request
  • the backend adapter builds the correct provider client
  • the provider client is what actually calls Azure, OpenAI, Cerebras, Groq, and so on

This is an important distinction. The goal is not to eliminate provider-specific clients. The goal is to stop re-implementing provider selection and fallback wiring in many call sites.

In the current codebase, this low-level provider construction already exists in ixllm.client_factory:

  • create_llm_client(...) dispatches by provider
  • azure maps to create_azure_openai_client(...)
  • openai maps to create_openai_client(...)
  • cerebras maps to create_cerebras_client(...)
  • groq maps to create_groq_client(...)
  • groq_direct maps to create_direct_groq_client(...)

That code remains useful. In V1, the resolver can call into the existing factory rather than replacing it immediately.

The architectural change is:

  • call sites stop deciding which provider constructor to call
  • ixllm owns that decision centrally
  • the factory becomes a low-level implementation detail behind the resolver

Concrete layering example: intent_classifier

The use-case policy stays declarative, defined in the consumer package that owns the use case:

# ixchat/routing.py
IXCHAT_POLICIES = {
    "intent_classifier": ChatPolicy(
        primary=ChatRoute(provider="azure", model="gpt-4.1-mini"),
        fallbacks=(
            ChatFallback(
                route=ChatRoute(provider="openai", model="gpt-4.1-mini"),
                on=("rate_limit",),
            ),
        ),
        structured_output=True,
    ),
}

The resolver uses that policy to choose the correct backend:

resolved = resolver.resolve_chat_use_case("intent_classifier")
result = await resolved.invoke(messages)

Under the hood, the intended flow is:

resolve_chat_use_case("intent_classifier")
  -> load resolved internal policy entry for intent_classifier
  -> choose AzureChatBackend
  -> AzureChatBackend.build_runnable(...)
  -> get_cached_llm_client(model="gpt-4.1-mini", provider="azure")
  -> create_llm_client(...)
  -> create_azure_openai_client(...)
  -> AzureChatOpenAI(...)
  -> .ainvoke(...)

If fallback is configured, the resolver then wraps the primary and fallback clients with ordered fallback behavior, using the same idea as the existing generic fallback wrapper.

So:

  • yes, there is still one provider-specific client path per provider family
  • no, there should not be one provider-specific path per use case
  • and usually there does not need to be one special client per model, because the model is mainly a parameter passed into the provider client

This is the real meaning of "one interface" in this plan: every call site uses one internal resolver API, while ixllm still hides provider-specific SDK details underneath.

Resolver responsibilities

  • Build one validated internal policy registry from code-owned profiles plus use-case policies during initialization.
  • Look up the already-resolved policy entry for the requested use case in that registry.
  • Validate that the chosen backend can support the requested use case.
  • Translate the resolved execution policy into concrete backend/client parameters.
  • Construct the primary client through the correct backend adapter.
  • Construct and wrap the ordered fallback chain if configured.
  • Own the use case's execution policy end to end, including timeout, retry budget, fallback mode, fallback triggers, and streaming failover semantics.
  • Return stable declared route metadata at resolution time.
  • Return stable actual execution metadata at invocation time so call sites stop inferring fallback truth themselves.

In this design, execution policy is owned by the resolver, not by incidental defaults in lower layers. The adapter layer translates explicit policy into provider-specific client parameters, while low-level factories and wrappers remain implementation details for client construction, connection reuse, and invocation.

What should move out of call sites

  • hardcoded default_model
  • hardcoded default_provider
  • provider-specific fallback construction
  • per-node knowledge of whether a fallback wrapper is required
  • manual fallback truth reconstruction for observability
  • startup-specific knowledge of which clients should be warmed
  • hidden execution defaults such as timeouts, retry budgets, and streaming failover behavior

Design constraints

  • Resolution should be request-scoped and non-mutating.
  • Resolving a route should not override chatbot.llm on a shared instance.
  • IXChatbot should become a site-scoped container, not the carrier of per-request LLM state.
  • Live routing should come only from the code-owned routing config; prompt metadata and other request-level override inputs must not alter route selection.
  • The resolver should separate route resolution from invocation because effective provider/model is only knowable after the call completes.

Why request scope is phase 1

This is not a nice-to-have. It should be the first implementation step.

Today the codebase caches one chatbot instance per site, and that chatbot can hold a shared fallback wrapper for its default LLM. If two requests for the same site run concurrently:

  1. Request A starts with declared route azure/gpt-4.1-mini
  2. Azure rate-limits, so the wrapper falls back to openai/gpt-4.1-mini
  3. The wrapper updates its internal "last used" state
  4. Before Request A logs tracing metadata, Request B uses the same wrapper
  5. Request B succeeds on Azure and overwrites the wrapper state

Now Request A can incorrectly report that Azure was used when OpenAI actually served the response.

This is why the new system should resolve a fresh route object per request, even if it reuses cached underlying provider clients internally.

There is a second version of the same problem that is even easier to picture:

  1. A shared chatbot instance for one site starts with chatbot.llm = azure/gpt-4.1
  2. Request A arrives and its prompt metadata selects openai/gpt-5.4
  3. The old override pattern temporarily assigns chatbot.llm = openai/gpt-5.4
  4. Before Request A finishes, Request B for the same site starts
  5. Request B reads chatbot.llm while it still contains Request A's temporary override
  6. Request B can now run with the wrong model, or restore the wrong value when it finishes

This is the core reason the new system should not try to "safely mutate" chatbot.llm. It should stop using chatbot.llm as the carrier of request routing state.

The intended ownership model is:

  • IXChatbot owns site-scoped shared resources such as graph, retriever, memory, and service access
  • the resolver owns route selection
  • the resolved use-case handle owns the active LLM route for exactly one request

Importantly, this remains necessary even if the LEGACY answer path is deprecated soon. Removing the LEGACY path removes one explicit chatbot.llm override/restore cycle, but it does not eliminate:

  • service-level request overrides that still mutate chatbot.llm before graph execution
  • shared fallback wrapper state on cached chatbot instances

So legacy deprecation reduces how much migration effort should be spent on the old path, but it does not remove the need for request-scoped route ownership in the NEW system.

The first migration step should therefore be:

  1. resolve a fresh route object per request
  2. reuse cached provider clients internally where appropriate
  3. return actual execution information from invocation results instead of shared wrapper state

LangChain And Langfuse In This Design

This direction keeps the current Langfuse and LangChain/LangGraph integration model intact.

LangChain / LangGraph

LangChain and LangGraph remain the execution abstraction. The new resolver decides what route to use; it does not replace the runtime composition model.

init_chat_model may still be useful later as an implementation detail to reduce branching in client_factory.py, but it is not the system boundary and not the architectural answer by itself.

Langfuse

Langfuse remains the observability layer:

  • call sites still invoke models with Langfuse callback handlers
  • the resolver returns stable declared route metadata at resolution time
  • the invocation result returns stable actual execution metadata after the call

The main improvement is that call sites no longer need to reconstruct routing truth themselves.

This matters even more if OpenRouter is added later, because OpenRouter may handle some provider fallback internally. In that case, the resolver should normalize response metadata into the same declared-route / actual-execution shape used elsewhere.

Warmup and tracing should follow the same rule:

  • startup warmup should consume the resolver's internal policy registry
  • request tracing should consume the request-scoped resolved route plus invocation result

Langfuse-driven model or provider overrides exist today, but they are not part of the target architecture. The rule for live traffic should be simple:

  • the configured route for a use case comes from the code-owned routing config
  • Langfuse prompt metadata does not change model, provider, or fallback selection
  • prompt metadata does not modify code_fallback
  • there is no separate runtime override path in the request pipeline
  • the request API itself does not expose a model/provider override escape hatch

During migration, this should be handled as a compatibility transition rather than a second routing mode:

  • local prompt frontmatter and Langfuse prompt config may still contain model / provider metadata for unmigrated paths
  • migrated paths ignore those fields completely for route selection
  • once those paths are migrated, the obsolete prompt-side routing metadata should be removed instead of left behind as dead config

Replay and evaluation workflows are a separate concern. If someone wants to try another model for an existing trace, they should copy the resolved prompt from that trace into Langfuse Playground and run the experiment there. That should not introduce a second routing control surface into the production pipeline.


Mapping Current Use Cases Into The Target Registry

This maps the current usage audit into the proposed shape. It is intentionally direct, so the first migration can preserve current behavior.

Prompt-driven model/provider overrides may exist in current code, but they are not part of the target architecture for any use case below.

Use Case Current Primary Current Fallback Kind Notes
answer_writer openai/gpt-5.4 openrouter/openai/gpt-5.2 on provider outage / timeout / rate limit chat Current default route is hardcoded in answer.py, not inherited from chatbot config
redirect_handler azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Same behavior as today
booking_handler azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Same behavior as today
intent_classifier azure/gpt-4.1-mini openai/gpt-4.1-mini on rate limit chat Structured output
interest_signals_detector cerebras/gpt-oss-120b azure/gpt-4.1-mini on 503/429 chat Structured output, provider-specific today
skill_selector cerebras/gpt-oss-120b azure/gpt-4.1-mini on configured errors chat Already closest to config-driven routing
visitor_profiler azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Structured output
profile_extractor azure/gpt-4.1-mini openai/gpt-4.1-mini on rate limit chat Structured output
follow_up_suggester azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Structured output wrapper
answer_suggester azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Structured output wrapper
dialog_supervisor azure/gpt-4.1-nano openai/gpt-4.1-nano on rate limit chat Structured output
query_rewriter groq_direct/llama-3.1-8b-instant Python pronoun rewrite chat V1's only planned CodeFallback; keep the existing first-turn fast path separate from failure fallback
conversation_classification openai/gpt-4.1-mini None today chat ixtagging; structured output against the site taxonomy
message_translation openai/gpt-4.1-mini None today chat ixtagging; batch translation of non-English conversations before classification
conversation_helpfulness_scoring openai/gpt-4.1-mini None today chat ixtagging; structured output scoring task
conversation_resolution_scoring openai/gpt-4.1-mini None today chat ixtagging; structured output scoring task
lightrag_retrieval azure/gpt-4.1-nano None today chat Good candidate for later alignment with standard resolver
lightrag_processing azure/gpt-4.1-mini None today chat Same as above
lightrag_embedding azure/text-embedding-3-small None today embedding Separate resolver branch, not BaseChatModel
lightrag_rerank cohere/rerank-v3.5 Return original docs rerank Non-chat domain, keep separate type

Migration Sequence

The migration should be incremental and should follow the architecture, not the other way around.

Phase 0: source-of-truth convergence

  1. Encode the current live routing behavior in per-package Python policy definitions (e.g. ixchat/routing.py, ixtagging/routing.py, ixrag/routing.py) and Python config defaults before introducing the resolver cutover.
  2. This convergence must include the live answer_writer route (openai/gpt-5.4 with openrouter/openai/gpt-5.2 fallback), the standard Azure/OpenAI chat routes, the current LightRAG routes, the skills route, the query rewriter route, and the current ixtagging routes for classification, translation, and scoring.
  3. Converge on one deployed routing policy for all deployed environments. Lower environments should validate the same use-case routes that production will use rather than keep alternate model/provider/fallback mixes.
  4. Remove model/provider override fields from live request models and stop accepting request-time routing overrides through the API layer.
  5. Remove model/provider/fallback routing keys from backend/apps/shared_data/config/*.toml once Python reflects the live behavior.
  6. Stop treating prompt-side model / provider metadata as routing authority for migrated use cases. During the transition, keep that metadata only where unmigrated paths still depend on it.
  7. After the affected use cases are migrated, remove obsolete prompt-side routing metadata from local prompt frontmatter and Langfuse prompt config.
  8. Keep TOML only for non-routing application settings after this phase.

This phase exists to avoid an accidental behavior change during the architecture migration. Today the Python defaults and the TOML-backed runtime values are not identical, so simply deleting TOML routing keys would change live behavior before the new resolver is even in place. The intended end state is not "base policy plus per-environment route tweaks"; it is one deployed routing policy, validated in lower environments and then promoted unchanged.

Phase 1: request-scoped resolver foundation

  1. Introduce request-scoped route resolution and invocation results.
  2. Stop relying on shared mutable chatbot state for routing truth. In migrated paths, nodes should not assign to chatbot.llm; they should invoke a request-scoped resolved use case directly.
  3. Start returning actual execution information from invocation results.
  4. Keep existing provider clients and fallback wrappers where helpful instead of fully redesigning every backend immediately, but use them only behind resolver-owned policy.
  5. Make execution policy explicit for migrated routes, including timeout, retry, fallback mode, fallback triggers, and streaming failover behavior.
  6. Treat the current mid-stream fallback behavior in FallbackLLM.astream as a real bug to fix during this phase. The wrapper must not switch to a fallback provider after any response chunk has already been emitted.
  7. Stop introducing new chatbot.llm mutation in the NEW system. The active route for a request should live in the resolved use-case handle, not on the shared chatbot instance.
  8. Remove request-time model override handling from the production query pipeline rather than carrying it forward behind the resolver.
  9. Do not carry prompt-level or runtime model/provider override behavior into the new resolver. Live route selection should come only from the code-owned routing config.

This phase is about correctness under concurrency.

Phase 2: migrate the standard ixchat chat paths

The best initial wave is the ixchat chat-model use cases that already rely on get_llm_for_prompt() or service.get_llm():

  • intent_classifier
  • profile_extractor
  • follow_up_suggester
  • answer_suggester
  • dialog_supervisor
  • visitor_profiler
  • redirect_handler
  • booking_handler
  • answer_writer

These can move first because they already depend on the main ixllm client path and mostly differ only by:

  • use-case name
  • structured vs non-structured output
  • whether the route uses a shared standard profile or a use-case-specific primary/fallback pair such as answer_writer

Within this wave, follow_up_suggester and answer_suggester are especially good first proofs for the adapter boundary because they already require provider-specific structured-output handling. Migrating them through the resolver should demonstrate that adapters own the invocation strategy itself, not just a yes/no capability check.

For these migrated paths, nodes should resolve and invoke a request-scoped ResolvedChatUseCase directly instead of reading from or writing to chatbot.llm.

This is the point where IXChatbot starts becoming a site-scoped container (retriever, memory, graph, service access, site metadata) rather than the holder of the request's active LLM route.

Phase 3: migrate the specialized chat paths

  • interest_signals_detector
  • skill_selector

These are slightly more specialized because they rely on non-Azure primaries plus explicit custom fallback rules.

If the LEGACY system is still present during this phase, treat it as a compatibility path rather than the target shape:

  • avoid further architectural investment in legacy-only chatbot.llm mutation patterns
  • keep any remaining legacy mutation localized until the path is removed
  • do not let legacy compatibility drive the ownership model of the new resolver

Phase 4: migrate the special-case nonstandard paths

  • query_rewriter
  • LightRAG text generation

These are special because they either use groq_direct or a parallel LightRAG-specific stack.

For query_rewriter, Phase 4 should preserve the current behavior shape but move ownership into the resolver:

  • keep the groq_direct primary route
  • keep the first-turn Python fast path as a call-site or use-case optimization
  • move the current Python-on-LLM-error behavior into resolver-owned CodeFallback
  • do not generalize CodeFallback beyond python_pronoun_rewrite in this phase unless a second real use case appears

Phase 5: migrate production batch/API LLM workloads

  • ixtagging conversation classification
  • ixtagging message translation
  • ixtagging helpfulness scoring
  • ixtagging resolution scoring

These should join the same routing control plane even though they are not part of the request-time chat graph. They are production backend behavior, they persist derived analytics data, and today they still bypass shared routing by calling create_llm_client(...) directly.

Once the ixchat resolver surface is stable, these are a good next wave because they do not need request-scoped graph plumbing or streaming semantics, but they should still resolve model/provider policy through ixllm so production backend behavior does not keep a second hardcoded routing system.

For V1, it should be explicit that evaluation and ad hoc tooling are not part of this migration wave:

  • ixevaluation
  • CLI regenerate flows
  • other internal tooling where direct model selection is intentional

Those flows can continue to choose models directly or use mocks/stubs for tests, but they should remain clearly outside the deployed routing control plane. They are not a reason to preserve environment-specific live routing for the same production use case.

Phase 6: migrate non-chat workloads

  • embeddings
  • rerank

These should share the same policy registry, but not necessarily the same runnable-construction code path as chat models.

Phase 7: optional backend simplification

After the resolver and policy model are stable:

  1. Use init_chat_model if it helps simplify internal factory branching.
  2. Formalize the already-implemented openrouter provider as an explicit ChatBackend behind the resolver instead of leaving it as answer-node-specific wiring.
  3. Migrate selected chat use cases beyond the already-landed answer-node fallback path to OpenRouter where the operational tradeoff makes sense.

At that point, OpenRouter can become the backbone for many chat use cases if it proves operationally useful, but it should still remain one backend option behind your own routing layer, not the solution to every workload.

This phase is explicitly out of scope for the first implementation unless separately requested, aside from preserving and eventually absorbing the already-landed answer-node/OpenRouter route into the same resolver-owned architecture.


Final Recommendation

Build the internal routing layer first, then simplify backends behind it.

That means:

  1. Fix correctness first with request-scoped, non-mutating route resolution.
  2. Unify routing policy through ixllm with typed use-case policies (defined per-package, assembled at startup) and explicit execution policy.
  3. Use workload-specific adapters plus capability validation so route changes are safe.
  4. Drive warmup, tracing, and observability from the same resolver outputs.
  5. Migrate call sites incrementally.
  6. Add OpenRouter later where it meaningfully simplifies chat, not as a replacement for the architecture itself.

For clarity, the minimum acceptable first delivery is:

  1. request-scoped route resolution and invocation results
  2. unified policy ownership for the migrated ixchat chat paths (defined in ixchat, resolved through ixllm)
  3. no prompt-level or runtime routing overrides in the live request path
  4. explicit execution policy for migrated routes, including streaming failover behavior and a fix for the current mid-stream fallback bug
  5. no new chatbot.llm mutation in migrated NEW-system paths
  6. warmup and observability aligned with resolver-owned routing data

Expected Benefits

Once this architecture is in place, the main benefits should be easy to see in two areas: engineering maintainability and runtime operations.

Engineering / maintainability

  • Model changes become policy changes instead of repeated code edits across many call sites.
  • Provider switches become safer because capability checks happen before runtime.
  • Provider switches become safer because timeout, retry, and fallback behavior are part of the use-case contract instead of hidden adapter defaults.
  • Provider-specific logic is isolated in ixllm instead of leaking into business logic nodes.
  • Fallback behavior is defined once instead of being split across ixchat, wrappers, and special cases.
  • Shared route profiles make bulk route changes easier and less error-prone.
  • New use cases should mostly mean adding a policy entry rather than writing new provider-wiring logic.
  • New providers can be added behind the same resolver boundary instead of forcing call-site rewrites.
  • The system can migrate incrementally instead of requiring a big-bang rewrite.

Operations / runtime behavior

  • Request-scoped routing removes the current shared-state risk around fallback tracking and tracing.
  • Declared route and actual execution are reported separately, so observability becomes more accurate even for code fallbacks and graceful degradation.
  • Warmup can use the same resolver-owned internal policy registry as runtime routing, reducing drift.
  • Langfuse prompt metadata stops being a hidden production routing surface.
  • Fallback behavior becomes more predictable because one resolver owns the order and trigger rules.
  • Streaming behavior becomes more predictable because failover is explicit instead of implicit mid-stream.
  • Cached provider clients can still be reused underneath, so cleaner architecture does not require sacrificing connection reuse.
  • If OpenRouter usage expands later, it can simplify many chat workloads without changing the system boundary again because it is already just another backend behind the resolver.
  • Chat, embeddings, and rerank can share one routing control plane without being forced into the same runtime shape.

This gives you the biggest simplification with the least operational risk and aligns with the actual goal: make model and provider switches easy, explicit, and maintainable across the whole backend, not just for one subset of chat calls.


Source References

Repository references

  • backend/packages/ixchat/ixchat/service.py
  • backend/packages/ixchat/ixchat/utils/model_override.py
  • backend/packages/ixllm/ixllm/client_factory.py
  • backend/packages/ixllm/ixllm/fallback_llm.py
  • backend/packages/ixllm/ixllm/prompts/langfuse.py
  • backend/packages/ixrag/ixrag/lightrag/lightrag_llm.py
  • backend/packages/ixtagging/ixtagging/service.py
  • backend/packages/ixtagging/ixtagging/classifier.py
  • backend/packages/ixtagging/ixtagging/translator.py
  • backend/packages/ixtagging/ixtagging/scorer.py
  • backend/packages/ixinfra/ixinfra/config/settings.py

External docs reviewed

  • LangChain init_chat_model
  • https://reference.langchain.com/python/langchain/models/#langchain.chat_models.init_chat_model
  • LangChain configurable models
  • https://docs.langchain.com/oss/python/langchain/models#configurable-models
  • LangChain configurable alternatives
  • https://reference.langchain.com/python/langchain_core/language_models/#configurable-alternatives-https-reference-langchain-com-python-langchain-core-language-models-langchain-core-language-models-basechatmodel-configurable-alternatives-copy-anchor-link-to-this-section-for-reference
  • LangChain base URL / OpenAI-compatible providers
  • https://docs.langchain.com/oss/python/langchain/models#base-url-or-proxy
  • LangChain ChatOpenAI custom provider parameters
  • https://reference.langchain.com/python/integrations/langchain_openai/ChatOpenAI/#custom-provider-parameters
  • LangChain ChatOpenRouter
  • https://reference.langchain.com/python/langchain-openrouter/chat_models/ChatOpenRouter
  • OpenRouter model routing
  • https://openrouter.ai/docs/features/model-routing#model-routing
  • OpenRouter provider ordering and fallbacks
  • https://openrouter.ai/docs/guides/routing/provider-selection.mdx#ordering-specific-providers
  • OpenRouter parameter compatibility filtering
  • https://openrouter.ai/docs/features/provider-routing?codeTab=Python+Example+with+Fallbacks+Enabled#requiring-providers-to-support-all-parameters
  • Langfuse LangChain / LangGraph integration
  • https://github.com/langfuse/langfuse-docs/blob/main/pages/integrations/frameworks/langchain.mdx?plain=1#L13#langchain-tracing-langgraph-integration