LLM Routing Architecture Plan¶
Date: 2026-03-10 Scope: Simplify model/provider selection and fallback handling across the backend Goal: Define a path toward one centralized, incremental model-routing system where each LLM use case can choose its provider, model, execution behavior, and ordered fallbacks independently, using one coherent policy model across the system, while preserving the current Langfuse tracing and prompt-management integration, avoiding a big-bang rewrite, and making provider/model switches easy and safe inside one code-owned routing layer rather than scattering them across the codebase.
Executive Summary¶
The simplification opportunity is not primarily a provider switch. The main win is to unify routing policy so every LLM use case resolves provider, model, and fallback behavior through one internal system: policies defined per-package, assembled and resolved through ixllm.
The chosen direction is:
- Make route resolution request-scoped and non-mutating first.
- Introduce a unified policy registry keyed by use case, with policies defined per-package and assembled at startup.
- Add workload-specific backend adapters (
chat,embedding,rerank) plus capability validation. - Move selection and fallback construction into one
ixllmresolver. - Keep LangChain/LangGraph as the execution abstraction.
- Keep Langfuse as the observability layer.
- Migrate call sites incrementally, starting with
ixchatchat-model use cases.
This gets you closer to the actual goal: being able to switch models and providers easily without repeatedly rewriting provider-specific wiring.
This routing layer is intended to become the single routing control plane for production backend LLM traffic, not just ixchat. That includes request-time chat/RAG workloads and production batch/API analytics workloads such as ixtagging, which currently instantiate models directly with create_llm_client(...). Evaluation and one-off tooling flows are different and do not need to be folded into that control plane in V1.
For the first implementation, the intended scope is narrower than the final target architecture:
- implement request-scoped routing first, even if it initially wraps existing factories and fallback wrappers
- keep routing policy code-owned through typed profiles and use-case policies
- keep one environment-invariant routing policy for deployed environments so lower environments validate the same routes that production will use
- retire TOML-backed model/provider/fallback routing and make Python policy/default code the only routing source of truth
- remove prompt-level, runtime, and pipeline-level model/provider override capability from live traffic entirely, including the public request surface
- do not add per-use-case TOML/env routing overrides
- do not add tenant-level routing overrides
- do not make OpenRouter a V1 architectural dependency or system boundary, even though OpenRouter provider support is now already implemented in
ixllmand used by the answer-node fallback path
OpenRouter is still relevant, but not as the first architectural move. It is now already available as a supported provider in ixllm and already used by the answer node's dedicated fallback path. That changes the implementation baseline, but not the architectural conclusion: OpenRouter can be a useful backend for selected chat workloads and may later reduce a lot of chat-specific provider wiring, but it does not solve the full problem by itself:
- it does not fix fragmented routing ownership
- it does not replace Azure embeddings
- it does not replace rerank-specific paths
- it does not remove the need for explicit use-case policy
- it can hide provider selection unless your own resolver normalizes routing metadata
So the right sequence is to build your internal routing layer first, then optionally add OpenRouter behind that layer for selected chat workloads.
V1 Decisions¶
These decisions are part of the plan, not follow-up questions:
- Live routing uses only the code-owned Python routing config. There is no prompt-level, runtime, or pipeline-level model/provider override path.
- The production request pipeline does not expose model/provider override capability. Public request models should not accept live routing overrides, and there is no hidden or admin-only override mode in production traffic.
- Live routing is environment-invariant across deployed environments. Development, staging, and production validate the same use-case routing policy instead of carrying different model/provider/fallback mixes for the same task.
- Prompt-side
model/providermetadata is not part of live routing in V1. During migration it may remain in local prompt frontmatter or Langfuse prompt config for compatibility with unmigrated paths, but migrated paths ignore it and the metadata is removed after cutover. - The target control plane is all production backend LLM use cases, including request-time paths and batch/API analytics workloads such as
ixtagging. Evaluation and one-off tooling flows can continue using low-level factories directly in V1. - TOML stops being a routing authority. Model/provider/fallback keys move into Python policy and defaults, and TOML keeps only non-routing application settings.
- Langfuse remains for tracing, prompt management, and observability, but not as a routing surface. If someone wants to try another model for an existing trace, they should copy the resolved prompt into Langfuse Playground and run that experiment there.
- Route selection is request-scoped. Migrated paths do not mutate
chatbot.llm; the active route lives in the resolved use-case handle for that request. - Warmup is derived from the resolver's internal policy registry so startup-prewarmed clients stay aligned with the same routing config that serves requests.
- The resolver owns execution policy end to end: timeout, retry budget, fallback chain, fallback triggers, code fallback, and streaming failover semantics.
Current Structural Problems¶
The current implementation is harder to change than it needs to be because every package that calls an LLM has independently evolved its own model selection, and no two packages do it the same way.
ixchat: fragmented selection with hidden statefulness¶
ixchat has the most routing code and the most routing problems:
- nodes carry their own default model/provider choices
ixchat.utils.model_override.get_llm_for_prompt()applies prompt overrides plus some provider-specific fallback behaviorixchat.service.get_llm()adds a separate Azure-specific fallback pathixllm.fallback_llmcontains the generic fallback wrapper, but higher layers still decide when and how to use it- Langfuse prompt metadata can override model/provider today, creating a second config surface for routing policy
The result is that changing one chat use case's model can require edits across node defaults, prompt metadata, service-level override logic, and fallback wrappers.
ixtagging: hardcoded models with no fallback¶
ixtagging has three separate LLM consumers — scorer.py (helpfulness and resolution scoring), translator.py (message translation), and classifier.py (conversation classification) — each calling create_llm_client() directly with hardcoded openai/gpt-4.1-mini. There is:
- no fallback chain: if OpenAI goes down, all scoring, classification, and translation silently fail
- no shared execution policy: each class manages its own timeout and retry behavior
- no single place to change the model: switching from
gpt-4.1-minito another model means editing three files independently
ixrag / LightRAG: parallel client construction stack¶
ixrag.lightrag.lightrag_llm duplicates the entire client construction layer outside ixllm. It has its own create_openai_client(), create_azure_openai_client(), and create_groq_client() functions, its own retry wrappers (gpt_openai_with_retry, groq_complete_with_retry), and its own embedding client (AzureOpenAIEmbeddings instantiated directly). Model selection is config-driven through a separate TOML path (config.lightrag.*), not through ixllm. Embeddings and rerank are entirely separate from any shared infrastructure.
Background jobs: inherited fragmentation with no observability¶
Production background jobs (run_classification.py, run_document_loader.py) inherit whatever routing pattern ixtagging or ixrag provides. They have no routing observability — there is no way to know after the fact which model served a scoring or document-processing request, and no fallback if the hardcoded provider goes down.
The common pattern¶
Every consumer group has ended up with its own model selection, its own fallback behavior (or none), and its own construction path. The specific problems differ — ixchat has statefulness and concurrency bugs, ixtagging has no resilience, ixrag has a duplicated construction stack — but the root cause is the same: there is no shared routing layer that all production LLM consumers resolve through.
There is also hidden statefulness that makes routing harder to reason about:
- prompt metadata can silently alter routing policy today
- the request API still exposes a
modeloverride input today, creating a third live routing surface - some request paths mutate the shared chatbot instance for tracing
- the chatbot singleton is still used in some paths as the carrier of the "currently active" LLM for the request
- fallback truth is reconstructed in call sites instead of returned by the routing layer
- startup warmup hardcodes a model matrix outside the routing policy itself
This plan removes that ambiguity by making the Python routing config the only live routing source and leaving Langfuse in an observability and prompt-management role only.
Chosen Direction¶
The target architecture is one unified policy registry (assembled from per-package policy definitions), one backend adapter layer, and one resolver API.
The architectural boundary should be:
Call Site
-> resolve_use_case(use_case)
-> uses policy + adapter + capability checks
-> returns workload-specific resolved handle + resolved route metadata
-> invoke/embed/rerank with Langfuse callback handler
This keeps:
- LangChain/LangGraph as the execution abstraction
- Langfuse as the observability abstraction
- provider/model policy resolved through one system (defined per-package, assembled at startup)
It also adds:
- backend-specific construction hidden behind workload adapters
- explicit compatibility checks before invocation
- one place to determine which clients should be prewarmed
Why this direction fits the existing codebase¶
ixllmalready owns the reusable client factory and fallback primitives.- The main pain is that routing decisions are scattered, not that there is no provider abstraction at all.
- The codebase already has distinct runtime domains (
chat, embeddings, rerank), so a workload-aware resolver fits the current shape better than a chat-only abstraction. - The backend also has production batch/API LLM consumers outside
ixchat, so a sharedixllmresolver (with per-package policy definitions assembled at startup) is a better fit than letting each package grow its own model-selection layer. - This can be migrated one call site at a time without a big-bang rewrite.
What this direction does not require¶
- No immediate switch away from Azure/OpenAI/Cerebras/Groq.
- No architecture-wide OpenRouter migration beyond the already-landed provider support and answer-node fallback usage.
- No immediate LightRAG redesign.
- No requirement to force one identical raw config schema into every layer on day one.
Why OpenRouter Is Not The First Step¶
OpenRouter is now already implemented as a supported provider in ixllm.client_factory and is already used in one production chat path as the answer node's fallback route. It could become even more useful later, especially for chat.
It is appealing because it can provide:
- model fallback via a
modelslist - provider routing via provider ordering
- provider capability filtering
- an OpenAI-compatible or LangChain-friendly chat interface
That makes it a strong candidate to simplify many chat-generation paths. In a later phase, it could plausibly become the default backend for a large share of standard chat use cases, including:
- answer generation
- intent and classification tasks
- follow-up and answer suggestion tasks
- skill selection
That would reduce a lot of direct provider-specific chat wiring.
However, it is not the right first architectural move because it does not solve the main source of complexity in this codebase.
If you adopt OpenRouter before centralizing policy, the codebase still keeps the main architectural problems:
- node-local defaults
- split override logic
ixchatservice special cases- separate LightRAG construction
- multiple policy surfaces
It also does not solve everything even after adoption:
- it does not replace Azure embeddings
- it does not replace the rerank provider path
- it does not remove the legacy prompt-level override behavior that should disappear from live routing
- it does not remove the need to report requested vs actual route consistently
So the correct posture is:
- do not make OpenRouter the system boundary
- do not use it as a substitute for a routing architecture
- treat the already-landed OpenRouter usage as one backend choice that must eventually be represented behind your own resolver like any other provider
That preserves architectural control while still letting you capture most of the operational upside later.
Policy Model¶
The policy registry is the backbone of the design, but it must do more than just centralize hardcoded tuples.
Core principles¶
- The registry should be keyed by use case, not by provider.
- Use-case policies should be defined by the package that owns the use case (
ixchat,ixtagging,ixrag), not centralized inixllm.ixllmowns the types, profiles, resolver, and adapters. The app layer assembles per-package policies into the resolver at startup. - The registry should be the typed default policy layer, not the only policy source.
- Chat, embedding, and rerank should share one conceptual routing model, but not be forced through one runtime shape.
groq_directshould remain a valid primary route for latency-sensitive chat-shaped workloads such asquery_rewriter.- Live request routing should come only from the code-owned routing config.
- Non-LLM code fallback should remain representable for cases like query rewriting.
Workload-specific adapters¶
The resolver should dispatch through workload-specific adapters:
ChatBackendEmbeddingBackendRerankBackend
This is the missing piece if the goal is "one interface" without relying on OpenRouter as the system boundary.
Each adapter should declare what it supports:
- structured output
- streaming
- tool calling
- supported execution knobs (timeout, retry, failover mode)
- provider-specific request options
- provider-specific invocation strategy when a capability is supported
This matters because a resolver that only returns Any still leaves the codebase dependent on provider quirks. The adapter layer is what makes provider/model swaps predictable.
That last point is important. Capability validation answers "can this route support structured output at all?" but safe provider switching also depends on "what is the correct invocation recipe for this provider?" In the current codebase, follow_up_suggester and answer_suggester already show why this matters: they do not just check whether structured output is available, they also choose json_schema vs json_mode based on provider and append an explicit JSON-format hint for non-OpenAI-style providers. That behavior belongs in the adapter boundary rather than staying embedded in call sites.
The ownership split should be strict:
- the resolver owns business-level execution policy
- adapters translate that policy into provider-specific client options
client_factory.pybuilds and caches clients from explicit inputs- fallback wrappers execute an already-resolved fallback chain rather than deciding policy themselves
Adapter layer: V1 vs later¶
The adapter layer should be introduced in V1, but it does not need to become the full provider abstraction immediately.
In V1, the adapter layer should mainly:
- provide an explicit runtime boundary between the resolver and provider construction
- validate capabilities before invocation
- own provider-specific invocation strategy for migrated capabilities such as structured output
- choose the correct low-level provider constructor
- translate resolver-owned execution policy into explicit provider/client arguments
- reuse the existing
ixllm.client_factoryand fallback wrappers where helpful, but only as execution primitives behind the resolver
That means the first implementation can look like:
AzureChatBackend.build_runnable(...)callingget_cached_llm_client(..., provider="azure")OpenAIChatBackend.build_runnable(...)callingget_cached_llm_client(..., provider="openai")- the resolver composing ordered fallback with the existing generic fallback mechanism from explicit policy data
So in V1, the new adapter layer is mostly a formalization of an architectural boundary, not a requirement to rewrite every provider client path. What matters is that migrated paths no longer inherit timeout, retry, or fallback behavior from implicit defaults in client_factory.py or fallback-specific helper factories.
For example, migrated structured-output chat use cases should stop deciding in the node whether they need json_schema or json_mode, and should stop appending provider-specific JSON hints themselves. The resolved handle returned by the resolver should already embody the correct invocation strategy for the selected backend.
One cache rule should also be explicit in V1: cache identity must be derived from the fully translated backend client parameters for a resolved route, not just from provider + model. If a route-defining option changes the underlying client behavior, it must participate in cache identity too. That includes options such as reasoning settings, timeout/retry variants that produce distinct clients, and provider-specific transport options such as OpenRouter upstream selection. By contrast, invocation-only behavior such as structured-output method selection, prompt shaping, callback handlers, or code fallback should remain outside client cache identity.
After V1, the adapter layer should gradually become the real home of provider-specific runtime behavior.
That later evolution should include:
- moving more provider branching out of
client_factory.pyand into explicit backend classes - centralizing capability handling such as structured output, streaming, tool calling, and provider-specific request options
- normalizing provider exceptions into internal fallback reasons such as
rate_limit,timeout, andprovider_unavailable - shrinking or removing helper factories that bundle fixed policy choices such as "Azure with OpenAI rate-limit fallback" once those choices are fully resolver-owned
- optionally returning richer wrappers instead of raw provider clients so actual route metadata and error normalization are captured consistently
In other words:
- V1 introduces the adapter boundary
- post-V1 makes the adapter layer the true abstraction boundary
This distinction is important for scope control. The first implementation should establish the boundary cleanly without forcing a big-bang rewrite of client_factory.py.
Capability validation¶
Without capability checks, "easy switching" becomes "easy to misconfigure."
For example, if intent_classifier requires structured output and someone switches it to:
then a registry-only system may accept the configuration and fail later at runtime when the call site reaches with_structured_output(...).
With an explicit adapter plus capability contract, the resolver can fail fast during route resolution with a validation error such as:
But capability validation alone is not enough for safe provider switching. A provider can support a capability in principle while still requiring a different invocation pattern. The current suggester nodes are a concrete example:
follow_up_suggesteruses structured output today, but switches betweenjson_schemaandjson_modebased on provider and appends a JSON hint for non-OpenAI-style providersanswer_suggesterdoes the same
So a naive switch from azure/gpt-4.1-nano to cerebras/gpt-oss-120b could pass a boolean capability check for structured output while still behaving incorrectly if the call path kept the OpenAI-style invocation recipe. The adapter contract should therefore cover both:
- whether the provider supports the capability
- which invocation strategy should be used when it does
For structured output in V1, that means the chat backend should own at least:
- the structured output method (
json_schemavsjson_mode) - whether prompt shaping is required for that provider, such as appending an explicit JSON-format instruction
- any provider-specific arguments needed to make the capability reliable
This is the difference between "the configuration is valid" and "the provider switch is operationally safe."
Typed workload policies and normalized fallback reasons¶
The registry should not use generic provider: str and model: str fields as its final shape. That is too permissive for a system that must support chat, embeddings, and rerank reliably.
Bad shape:
This makes invalid combinations easy to express:
LLMPolicy(kind="embedding", provider="cerebras", model="gpt-oss-120b")
LLMPolicy(kind="chat", provider="cohere", model="rerank-v3.5")
LLMPolicy(kind="rerank", provider="azure", model="gpt-4.1-mini")
Better shape:
from dataclasses import dataclass, field
from typing import Literal
FallbackReason = Literal[
"rate_limit",
"provider_unavailable",
"timeout",
"connection_error",
]
FallbackMode = Literal["immediate", "after_retries", "never"]
StreamingFailover = Literal["before_first_token_only", "never"]
ReasoningEffort = Literal["none", "low", "medium", "high"]
@dataclass(frozen=True)
class RetryPolicy:
max_attempts: int
backoff: Literal["none", "exponential"]
@dataclass(frozen=True)
class TimeoutPolicy:
total_seconds: float
@dataclass(frozen=True)
class OpenAIRouteOptions:
reasoning_effort: ReasoningEffort | None = None
@dataclass(frozen=True)
class OpenRouterRouteOptions:
allowed_upstreams: tuple[Literal["azure", "openai"], ...] | None = None
allow_provider_fallbacks: bool = True
reasoning_effort: ReasoningEffort | None = None
@dataclass(frozen=True)
class AzureChatRoute:
provider: Literal["azure"] = "azure"
model: ChatModel = "gpt-4.1-mini"
@dataclass(frozen=True)
class OpenAIChatRoute:
provider: Literal["openai"] = "openai"
model: ChatModel = "gpt-4.1-mini"
options: OpenAIRouteOptions = field(default_factory=OpenAIRouteOptions)
@dataclass(frozen=True)
class OpenRouterChatRoute:
provider: Literal["openrouter"] = "openrouter"
model: ChatModel = "openai/gpt-5.2"
options: OpenRouterRouteOptions = field(default_factory=OpenRouterRouteOptions)
@dataclass(frozen=True)
class GroqDirectChatRoute:
provider: Literal["groq_direct"] = "groq_direct"
model: ChatModel = "llama-3.1-8b-instant"
ChatRoute = AzureChatRoute | OpenAIChatRoute | OpenRouterChatRoute | GroqDirectChatRoute
@dataclass(frozen=True)
class EmbeddingRoute:
provider: EmbeddingProvider
model: EmbeddingModel
@dataclass(frozen=True)
class RerankRoute:
provider: RerankProvider
model: RerankModel
@dataclass(frozen=True)
class ChatFallback:
route: ChatRoute
on: tuple[FallbackReason, ...]
@dataclass(frozen=True)
class CodeFallback:
kind: Literal["python_pronoun_rewrite"]
on: tuple[FallbackReason, ...]
@dataclass(frozen=True)
class ChatPolicy:
# Exactly one of `primary` or `profile` should be set.
primary: ChatRoute | None = None
profile: str | None = None
fallbacks: tuple[ChatFallback, ...] = ()
code_fallback: CodeFallback | None = None
structured_output: bool = False
streaming: bool = False
execution: "ChatExecutionPolicy | None" = None
@dataclass(frozen=True)
class ChatExecutionPolicy:
timeout: TimeoutPolicy
retry: RetryPolicy
fallback_mode: FallbackMode
streaming_failover: StreamingFailover
@dataclass(frozen=True)
class EmbeddingPolicy:
primary: EmbeddingRoute
@dataclass(frozen=True)
class RerankPolicy:
primary: RerankRoute
on_failure: Literal["return_original_documents", "raise"] = "raise"
This makes invalid cross-workload combinations much harder to express and keeps fallback policy stable even if provider SDK exception classes change. It also makes operational behavior explicit, so switching models or providers does not silently inherit the wrong timeout, retry, or failover behavior from adapter defaults.
It also solves an important gap in the current codebase: some live routes are not fully defined by provider + model alone. The current answer_writer fallback is not just "OpenRouter with openai/gpt-5.2". It is OpenRouter pinned to Azure as the upstream, with OpenRouter-side provider fallback disabled, while the GPT-5 primary route also carries a provider-specific reasoning setting. Those are route-defining choices, not incidental helper behavior, so they should be representable in the policy model itself rather than hidden in factory code.
For example, the current answer-writer route is closer to this:
ChatPolicy(
primary=OpenAIChatRoute(
model="gpt-5.4",
options=OpenAIRouteOptions(
reasoning_effort="none",
),
),
fallbacks=(
ChatFallback(
route=OpenRouterChatRoute(
model="openai/gpt-5.2",
options=OpenRouterRouteOptions(
allowed_upstreams=("azure",),
allow_provider_fallbacks=False,
reasoning_effort="none",
),
),
on=("rate_limit", "provider_unavailable", "timeout", "connection_error"),
),
),
streaming=True,
)
The adapter should then translate those typed route options into provider-specific SDK or transport arguments:
def build_openrouter_client(route: OpenRouterChatRoute) -> Any:
extra_body = {}
if route.options.allowed_upstreams is not None:
extra_body["provider"] = {
"only": list(route.options.allowed_upstreams),
"allow_fallbacks": route.options.allow_provider_fallbacks,
}
if route.options.reasoning_effort is not None:
extra_body["reasoning"] = {"effort": route.options.reasoning_effort}
return get_cached_llm_client(
model=route.model,
provider="openrouter",
extra_body=extra_body or None,
)
The important design rule is:
- typed route options belong in the policy model when they materially define the route's behavior
- adapters translate those typed options into provider-specific request arguments
- raw transport payloads such as
extra_bodyshould remain an adapter detail, not the public policy shape
CodeFallback is intentionally separate from provider fallbacks. It represents deterministic degraded-mode behavior owned by the resolver, not a second provider route. In V1, it should stay intentionally narrow and solve only the real case that exists today: query rewriter's Python fallback.
The intended contract in V1 is:
- provider/model fallbacks remain part of the ordered
fallbackschain code_fallbackis attempted only after the LLM route fails with an allowed normalized fallback reason- the resolver executes the code fallback, not the call site
- the code fallback runs on the use-case input, not on raw provider-specific prompt strings
- the code fallback returns the same logical output type as the primary use case
- the invocation result reports it as
ActualExecution(target_type="code", ...) - prompt metadata may not introduce, replace, or modify a
code_fallback - if the code fallback itself fails, the request fails; there is no second fallback layer behind it
This keeps non-LLM fallback logic inside the same routing control plane instead of scattering it across call sites.
The practical scope should be deliberately small:
- keep
CodeFallbackresolver-owned - support exactly one concrete fallback kind in V1:
python_pronoun_rewrite - use it only for
query_rewriterin V1 - do not introduce a generic plugin/registry system for arbitrary code fallbacks yet
For example, the current intent_classifier route becomes:
ChatPolicy(
primary=AzureChatRoute(model="gpt-4.1-mini"),
fallbacks=(
ChatFallback(
route=OpenAIChatRoute(model="gpt-4.1-mini"),
on=("rate_limit",),
),
),
structured_output=True,
execution=ChatExecutionPolicy(
timeout=TimeoutPolicy(total_seconds=30),
retry=RetryPolicy(max_attempts=0, backoff="none"),
fallback_mode="immediate",
streaming_failover="never",
),
)
And the current query_rewriter route remains explicit about its special-case behavior:
ChatPolicy(
primary=GroqDirectChatRoute(model="llama-3.1-8b-instant"),
code_fallback=CodeFallback(
kind="python_pronoun_rewrite",
on=("timeout", "connection_error", "provider_unavailable"),
),
structured_output=False,
streaming=False,
execution=ChatExecutionPolicy(
timeout=TimeoutPolicy(total_seconds=1),
retry=RetryPolicy(max_attempts=0, backoff="none"),
fallback_mode="never",
streaming_failover="never",
),
)
In this design, query_rewriter remains a chat-shaped use case, but its degraded-mode fallback is still owned by the resolver. The call site continues to call one resolved use case; it does not implement separate Python fallback logic itself.
Concretely, the resolver-owned path should behave like:
- try the configured LLM route for query rewriting
- if the failure reason matches the allowed
CodeFallback.onreasons, runpython_pronoun_rewrite - return the rewritten query and report
ActualExecution(target_type="code", target="python_pronoun_rewrite")
The current first-turn optimization in query rewriter should remain separate from CodeFallback. It is a fast-path rule ("no history, so don't call the LLM"), not a failure fallback.
Reusable route profiles¶
A use-case registry alone still encourages repetition when many use cases share the same route. If every use case restates the full provider/model/fallback tuple, large switches remain noisy and error-prone.
The safer pattern is:
- keep the registry keyed by use case
- allow a use case to reference a reusable route profile for common routes
- let the use case specialize only what is specific to that workload
Policy ownership: per-package, not centralized in ixllm¶
Use-case policies should be defined by the package that owns the use case, not centralized in ixllm. This keeps the dependency direction clean: consumer packages depend on ixllm for types and the resolver, but ixllm never imports or knows about ixchat, ixtagging, or ixrag.
The ownership split is:
ixllmowns the framework: policy types (ChatPolicy,ChatRoute, etc.), route profiles (CHAT_ROUTE_PROFILES), the resolver, and backend adapters- each consumer package owns its use-case policies:
ixchatdefines policies forintent_classifier,answer_writer, etc.;ixtaggingdefines policies forconversation_classification, etc.;ixragdefines policies forlightrag_embedding, etc. - the app layer assembles all per-package policies into the resolver at startup
For example, shared route profiles live in ixllm:
# ixllm/routing/profiles.py
CHAT_ROUTE_PROFILES = {
"chat.answer_primary": ChatRouteProfile(
primary=OpenAIChatRoute(
model="gpt-5.4",
options=OpenAIRouteOptions(
reasoning_effort="none",
),
),
fallbacks=(
ChatFallback(
route=OpenRouterChatRoute(
model="openai/gpt-5.2",
options=OpenRouterRouteOptions(
allowed_upstreams=("azure",),
allow_provider_fallbacks=False,
reasoning_effort="none",
),
),
on=("rate_limit", "provider_unavailable", "timeout", "connection_error"),
),
),
),
"chat.standard_primary": ChatRouteProfile(
primary=AzureChatRoute(model="gpt-4.1"),
fallbacks=(
ChatFallback(
route=OpenAIChatRoute(model="gpt-4.1"),
on=("rate_limit",),
),
),
),
"chat.fast_small": ChatRouteProfile(
primary=AzureChatRoute(model="gpt-4.1-nano"),
fallbacks=(
ChatFallback(
route=OpenAIChatRoute(model="gpt-4.1-nano"),
on=("rate_limit",),
),
),
),
}
Each consumer package defines its own use-case policies:
# ixchat/routing.py
from ixllm.routing import ChatPolicy, AzureChatRoute, GroqDirectChatRoute, ...
IXCHAT_POLICIES = {
"answer_writer": ChatPolicy(
profile="chat.answer_primary",
streaming=True,
),
"dialog_supervisor": ChatPolicy(
profile="chat.fast_small",
structured_output=True,
),
"intent_classifier": ChatPolicy(
primary=AzureChatRoute(model="gpt-4.1-mini"),
fallbacks=(...),
structured_output=True,
),
"query_rewriter": ChatPolicy(
primary=GroqDirectChatRoute(model="llama-3.1-8b-instant"),
code_fallback=CodeFallback(kind="python_pronoun_rewrite", ...),
),
# ... other ixchat use cases
}
# ixtagging/routing.py
from ixllm.routing import ChatPolicy, OpenAIChatRoute
IXTAGGING_POLICIES = {
"conversation_classification": ChatPolicy(
primary=OpenAIChatRoute(model="gpt-4.1-mini"),
structured_output=True,
),
"message_translation": ChatPolicy(
primary=OpenAIChatRoute(model="gpt-4.1-mini"),
),
# ... other ixtagging use cases
}
# ixrag/routing.py
from ixllm.routing import EmbeddingPolicy, RerankPolicy, ...
IXRAG_POLICIES = {
"lightrag_embedding": EmbeddingPolicy(primary=EmbeddingRoute(...)),
"lightrag_rerank": RerankPolicy(primary=RerankRoute(...)),
# ... other ixrag use cases
}
The app layer assembles them into the resolver at startup:
# app startup (e.g. service.py or app entrypoint)
from ixllm.routing import LLMResolver, CHAT_ROUTE_PROFILES
from ixchat.routing import IXCHAT_POLICIES
from ixtagging.routing import IXTAGGING_POLICIES
from ixrag.routing import IXRAG_POLICIES
resolver = LLMResolver(
profiles=CHAT_ROUTE_PROFILES,
policies={**IXCHAT_POLICIES, **IXTAGGING_POLICIES, **IXRAG_POLICIES},
)
This preserves per-use-case control while making bulk route changes operationally easier. It also keeps the dependency graph clean: ixchat → ixllm (imports types), never the reverse.
The important point is that the final schema should support both:
- inline route definitions for one-off use cases
- profile references for common shared routes
but it should do so through one explicit, internally consistent policy type rather than two parallel policy shapes.
Resolver-owned policy registry¶
For V1, routing policy should remain code-only. Profiles and use-case policies are enough to centralize ownership and simplify the architecture without adding a second deployment-time control surface or environment-specific routing layer for deployed environments.
That means the effective V1 policy is built by the resolver from:
- code-defined route profiles (owned by
ixllm) - code-defined use-case policies (owned by each consumer package:
ixchat,ixtagging,ixrag, etc.) - assembled into the resolver by the app layer at startup
There is no second request-time routing merge layer in the plan. Live routing does not accept prompt-level, runtime, or pipeline-level model/provider overrides.
The practical approach is:
- code owns the full typed policy model:
ixllmowns the types and route profiles, each consumer package owns its use-case policies, and the app layer assembles them into the resolver - Python code also owns the static routing defaults; TOML is not a parallel routing authority in V1
- the deployed routing policy is environment-invariant, so development, staging, and production all resolve the same use-case routes
profileis code-only in V1- model/provider/fallback selection is resolved from code-owned config only
- the live request API does not accept model/provider overrides as alternate routing inputs
- Langfuse prompt metadata is ignored for live route selection
- prompt metadata may not modify
code_fallback - prompt-side
model/providermetadata may continue to exist temporarily in local prompt frontmatter or Langfuse prompt config while unmigrated paths still depend on it, but migrated paths must ignore it - once the affected use cases are migrated, that prompt-side routing metadata should be removed rather than kept as dead configuration
- if someone wants to test another model for a traced prompt, they should copy the resolved prompt from the trace into Langfuse Playground instead of using a runtime override path
- tests, evaluation, and ad hoc tooling may still mock or select models directly, but that is outside the deployed routing control plane and must not become a second live routing config surface
In other words, the system should share one conceptual policy shape, and the resolver should build one validated internal policy registry from it at startup while keeping all live routing policy ownership in code for the first implementation.
That also means:
- model/provider/fallback routing should be removed from
backend/apps/shared_data/config/*.toml - the live routing defaults currently coming from TOML must be copied into per-package Python policy definitions before TOML routing keys are removed
- the migrated per-package Python policies, assembled into the resolver at startup, must become the single deployed routing policy across development, staging, and production rather than a base layer with environment-specific route overrides
- prompt-side routing metadata (
model/provider) should no longer be treated as authoritative once a use case is migrated to the resolver - during migration, prompt-side routing metadata can remain in place only for compatibility with unmigrated paths
- after migration of the affected use cases, prompt-side routing metadata should be removed from both local prompt frontmatter and Langfuse prompt config
- this convergence step must be explicit because the current Pydantic defaults do not match current TOML-backed runtime values for several routes, including the default chat route and LightRAG defaults
- after that convergence, TOML may still carry non-routing settings, but not model/provider/fallback selection
Internal policy registry¶
The runtime should also have a distinct internal policy registry: the static routing view that the resolver builds from code-owned routing config during initialization.
This is operationally important because:
- startup warmup should use it
- boot-time validation should use it
- debugging tools should be able to print it
Suppose the resolver's internal policy entry for answer_writer is:
The safer pattern is:
- Initialize the resolver from code-defined profiles + use-case policies.
- Let the resolver build one validated internal policy registry from them.
- Warm clients from that registry.
- Resolve requests from that same registry, then report actual execution if a configured fallback was used at runtime.
The startup contract should be explicit:
- warmup must be driven from the resolver's internal policy registry, not from a hand-maintained model list in the app layer
- warmup must include configured primaries and configured fallbacks
- warmup must include any execution-policy variants that produce distinct cached clients, such as retry-mode differences that affect cache identity
- warmup should remain best-effort operational behavior, but the source of truth for what gets warmed is still the resolver's internal policy registry built from code-owned routing config
This is better than today's implementation because today's startup warmup is a manual matrix in the app layer. That approach drifts as soon as routing changes, which is already visible in the current answer-writer path: the live route changed, but the warmup list did not. In the new design, warmup stays aligned automatically because it is derived from the same resolver-owned policy registry that serves requests.
Streaming failover semantics¶
Streaming should have an explicit failover rule instead of implicitly behaving like non-streaming invocation.
This is not just a design preference. It is a current correctness bug in the existing fallback wrapper: if the primary provider emits some streamed output and then fails, the fallback provider starts a second fresh generation. That can produce stitched output made of two different runs.
For example, the user might receive something like:
where the first fragment came from the primary provider and the second fragment came from the fallback provider restarting the answer.
For V1, the recommended rule is:
- allow fallback only before the first token is emitted
- after the first token, do not switch providers mid-stream
- if the primary stream fails after the first token, surface the failure upward rather than splicing in fallback output
This avoids mixing partial output from one provider with continuation from another provider that does not share the same generation state.
Target Resolver Design¶
The main new abstraction should be one resolver in ixllm, not a new provider wrapper everywhere.
In simple terms:
- the policy says the routing rule
- the resolver applies that rule for one use case and one request
- the backend adapter and client factory build the actual provider client underneath
So the resolver is the runtime layer that turns "this use case should use Azure with this fallback and this execution behavior" into "here is the exact thing to invoke for this request." That is why this abstraction exists. Without it, every call site has to keep re-implementing model selection, fallback construction, capability handling, and tracing metadata in slightly different ways.
The code in this section is intentionally pseudo code. It is here to illustrate the shape of the resolver, what layer should own it, and where it should be called from. Exact class names, method names, and module boundaries can differ in the implementation as long as the ownership model stays the same.
Minimal API surface¶
# The app layer assembles per-package policies into the resolver at startup.
resolver = LLMResolver(
profiles=CHAT_ROUTE_PROFILES,
policies={**IXCHAT_POLICIES, **IXTAGGING_POLICIES, **IXRAG_POLICIES},
)
resolved = resolver.resolve_use_case(
use_case="intent_classifier",
)
result = await resolved.invoke(messages)
For non-chat workloads, the same resolver returns a different resolved handle:
resolved = resolver.resolve_use_case("lightrag_embedding")
embedding_result = await resolved.embed(texts)
resolved = resolver.resolve_use_case("lightrag_rerank")
rerank_result = await resolved.rerank(query, documents)
In practice, this resolver remains owned by ixllm, but each consumer package receives it through a different injection pattern. The app layer assembles the resolver at startup; each package uses whatever injection mechanism fits its runtime shape.
ixchat: request context through the graph¶
ixchat.service passes the resolver into the graph's request context. Nodes access it from there:
# ixchat service.py pseudo code
request_context = ChatRequestContext(
chatbot=chatbot,
resolver=llm_resolver,
)
graph.invoke(..., context=request_context)
# ixchat node pseudo code
resolved = request_context.resolver.resolve_use_case(
use_case="intent_classifier",
)
result = await resolved.invoke(messages)
ixtagging: constructor injection¶
TaggingService currently lazily creates HelpfulnessScorer(), ConversationClassifier(), and MessageTranslator() with no-arg constructors. After migration, the resolver is passed at construction time:
# ixtagging/service.py pseudo code
class TaggingService:
def __init__(self, resolver: LLMResolver, ...):
self._resolver = resolver
def _get_classifier(self) -> ConversationClassifier:
if self._classifier is None:
self._classifier = ConversationClassifier(resolver=self._resolver)
return self._classifier
# ixtagging/classifier.py pseudo code — replaces create_llm_client() call
resolved = self._resolver.resolve_use_case("conversation_classification")
result = await resolved.invoke(messages)
ixrag / LightRAG: function factory capture¶
LightRAG uses callback-style functions (create_llm_function(), embedding_func_with_retry()) rather than class-based call sites. The resolver is captured by these function factories at construction time, replacing the direct client construction inside lightrag_llm.py:
# ixrag/lightrag/lightrag_llm.py pseudo code
def create_llm_function(resolver: LLMResolver) -> Callable:
async def llm_func(prompt: str, ...) -> str:
resolved = resolver.resolve_use_case("lightrag_retrieval")
result = await resolved.invoke(messages)
return result.output
return llm_func
# ixrag embedding pseudo code
def create_embedding_function(resolver: LLMResolver) -> Callable:
async def embed(texts: list[str]) -> list[list[float]]:
resolved = resolver.resolve_use_case("lightrag_embedding")
result = await resolved.embed(texts)
return result.vectors
return embed
Background jobs: app-layer assembly¶
Jobs like run_classification.py currently call get_tagging_service() which returns a TaggingService with no routing awareness. After migration, the job entrypoint assembles the resolver and passes it through:
# jobs/tagging/run_classification.py pseudo code
from ixllm.routing import LLMResolver, CHAT_ROUTE_PROFILES
from ixtagging.routing import IXTAGGING_POLICIES
resolver = LLMResolver(
profiles=CHAT_ROUTE_PROFILES,
policies=IXTAGGING_POLICIES,
)
service = get_tagging_service(resolver=resolver)
The job itself never touches routing. It just passes the assembled resolver to the service layer.
Key placement rules¶
service.pyshould not choose provider/model/fallback itselfIXChatbotshould not store the active request route inchatbot.llmixtaggingcomponents should not callcreate_llm_client()directlylightrag_llm.pyshould not duplicate client construction fromixllm- nodes, services, and function factories should ask
ixllmto resolve the route for the specific use case they are about to execute
In practical terms, the new system should replace this pattern:
llm, original_llm, model_name, provider_name = get_llm_for_prompt(...)
chatbot.llm = llm
try:
result = await some_node_logic(chatbot.llm, ...)
finally:
chatbot.llm = original_llm
with this pattern:
resolved = request_context.resolver.resolve_use_case(
use_case="answer_writer",
)
result = await resolved.invoke(messages)
The difference is architectural, not cosmetic:
- in the old pattern, the chosen route is stored on a shared chatbot object
- in the new pattern, the chosen route belongs to one request only
- the chatbot remains a shared site-scoped container, but the active LLM route is request-scoped
Suggested return shapes¶
@dataclass(frozen=True)
class DeclaredRoute:
kind: Literal["chat", "embedding", "rerank"]
primary_target: str
fallback_targets: tuple[str, ...] = ()
@dataclass(frozen=True)
class ActualExecution:
target_type: Literal["provider", "code", "degraded"]
target: str
fallback_triggered: bool
@dataclass
class ResolvedChatUseCase:
runnable: Any
declared_route: DeclaredRoute
execution: ChatExecutionPolicy
capabilities: dict[str, bool]
metadata: dict[str, Any]
@dataclass
class ChatInvocationResult:
output: Any
declared_route: DeclaredRoute
actual_execution: ActualExecution
metadata: dict[str, Any]
@dataclass
class ResolvedEmbeddingUseCase:
declared_route: DeclaredRoute
metadata: dict[str, Any]
@dataclass
class EmbeddingInvocationResult:
vectors: list[list[float]]
declared_route: DeclaredRoute
actual_execution: ActualExecution
metadata: dict[str, Any]
@dataclass
class ResolvedRerankUseCase:
declared_route: DeclaredRoute
metadata: dict[str, Any]
@dataclass
class RerankInvocationResult:
documents: list[Any]
declared_route: DeclaredRoute
actual_execution: ActualExecution
metadata: dict[str, Any]
This two-phase shape is important. Declared route information is available when the use case is resolved. Actual execution information is only knowable after the call completes. The shared routing metadata is consistent across workloads, but chat, embeddings, and rerank keep workload-appropriate method and result shapes.
Concrete example: resolve vs invoke¶
For intent_classifier, the resolved route can say:
declared_route.kind = chat
declared_route.primary_target = azure/gpt-4.1-mini
declared_route.fallback_targets = (openai/gpt-4.1-mini,)
But only the invocation result can truthfully say:
actual_execution.target_type = provider
actual_execution.target = openai/gpt-4.1-mini
actual_execution.fallback_triggered = true
if Azure rate-limits and the request falls back.
What still calls the provider API¶
The new policy layer does not remove the need for provider-specific calling code. It changes where that code lives and who chooses when to use it.
The intended runtime layering is:
That means:
- the policy says which route should be used
- the resolver decides which backend adapter to use for this request
- the backend adapter builds the correct provider client
- the provider client is what actually calls Azure, OpenAI, Cerebras, Groq, and so on
This is an important distinction. The goal is not to eliminate provider-specific clients. The goal is to stop re-implementing provider selection and fallback wiring in many call sites.
In the current codebase, this low-level provider construction already exists in ixllm.client_factory:
create_llm_client(...)dispatches by providerazuremaps tocreate_azure_openai_client(...)openaimaps tocreate_openai_client(...)cerebrasmaps tocreate_cerebras_client(...)groqmaps tocreate_groq_client(...)groq_directmaps tocreate_direct_groq_client(...)
That code remains useful. In V1, the resolver can call into the existing factory rather than replacing it immediately.
The architectural change is:
- call sites stop deciding which provider constructor to call
ixllmowns that decision centrally- the factory becomes a low-level implementation detail behind the resolver
Concrete layering example: intent_classifier¶
The use-case policy stays declarative, defined in the consumer package that owns the use case:
# ixchat/routing.py
IXCHAT_POLICIES = {
"intent_classifier": ChatPolicy(
primary=ChatRoute(provider="azure", model="gpt-4.1-mini"),
fallbacks=(
ChatFallback(
route=ChatRoute(provider="openai", model="gpt-4.1-mini"),
on=("rate_limit",),
),
),
structured_output=True,
),
}
The resolver uses that policy to choose the correct backend:
resolved = resolver.resolve_chat_use_case("intent_classifier")
result = await resolved.invoke(messages)
Under the hood, the intended flow is:
resolve_chat_use_case("intent_classifier")
-> load resolved internal policy entry for intent_classifier
-> choose AzureChatBackend
-> AzureChatBackend.build_runnable(...)
-> get_cached_llm_client(model="gpt-4.1-mini", provider="azure")
-> create_llm_client(...)
-> create_azure_openai_client(...)
-> AzureChatOpenAI(...)
-> .ainvoke(...)
If fallback is configured, the resolver then wraps the primary and fallback clients with ordered fallback behavior, using the same idea as the existing generic fallback wrapper.
So:
- yes, there is still one provider-specific client path per provider family
- no, there should not be one provider-specific path per use case
- and usually there does not need to be one special client per model, because the model is mainly a parameter passed into the provider client
This is the real meaning of "one interface" in this plan: every call site uses one internal resolver API, while ixllm still hides provider-specific SDK details underneath.
Resolver responsibilities¶
- Build one validated internal policy registry from code-owned profiles plus use-case policies during initialization.
- Look up the already-resolved policy entry for the requested use case in that registry.
- Validate that the chosen backend can support the requested use case.
- Translate the resolved execution policy into concrete backend/client parameters.
- Construct the primary client through the correct backend adapter.
- Construct and wrap the ordered fallback chain if configured.
- Own the use case's execution policy end to end, including timeout, retry budget, fallback mode, fallback triggers, and streaming failover semantics.
- Return stable declared route metadata at resolution time.
- Return stable actual execution metadata at invocation time so call sites stop inferring fallback truth themselves.
In this design, execution policy is owned by the resolver, not by incidental defaults in lower layers. The adapter layer translates explicit policy into provider-specific client parameters, while low-level factories and wrappers remain implementation details for client construction, connection reuse, and invocation.
What should move out of call sites¶
- hardcoded
default_model - hardcoded
default_provider - provider-specific fallback construction
- per-node knowledge of whether a fallback wrapper is required
- manual fallback truth reconstruction for observability
- startup-specific knowledge of which clients should be warmed
- hidden execution defaults such as timeouts, retry budgets, and streaming failover behavior
Design constraints¶
- Resolution should be request-scoped and non-mutating.
- Resolving a route should not override
chatbot.llmon a shared instance. IXChatbotshould become a site-scoped container, not the carrier of per-request LLM state.- Live routing should come only from the code-owned routing config; prompt metadata and other request-level override inputs must not alter route selection.
- The resolver should separate route resolution from invocation because effective provider/model is only knowable after the call completes.
Why request scope is phase 1¶
This is not a nice-to-have. It should be the first implementation step.
Today the codebase caches one chatbot instance per site, and that chatbot can hold a shared fallback wrapper for its default LLM. If two requests for the same site run concurrently:
- Request A starts with declared route
azure/gpt-4.1-mini - Azure rate-limits, so the wrapper falls back to
openai/gpt-4.1-mini - The wrapper updates its internal "last used" state
- Before Request A logs tracing metadata, Request B uses the same wrapper
- Request B succeeds on Azure and overwrites the wrapper state
Now Request A can incorrectly report that Azure was used when OpenAI actually served the response.
This is why the new system should resolve a fresh route object per request, even if it reuses cached underlying provider clients internally.
There is a second version of the same problem that is even easier to picture:
- A shared chatbot instance for one site starts with
chatbot.llm = azure/gpt-4.1 - Request A arrives and its prompt metadata selects
openai/gpt-5.4 - The old override pattern temporarily assigns
chatbot.llm = openai/gpt-5.4 - Before Request A finishes, Request B for the same site starts
- Request B reads
chatbot.llmwhile it still contains Request A's temporary override - Request B can now run with the wrong model, or restore the wrong value when it finishes
This is the core reason the new system should not try to "safely mutate" chatbot.llm. It should stop using chatbot.llm as the carrier of request routing state.
The intended ownership model is:
IXChatbotowns site-scoped shared resources such as graph, retriever, memory, and service access- the resolver owns route selection
- the resolved use-case handle owns the active LLM route for exactly one request
Importantly, this remains necessary even if the LEGACY answer path is deprecated soon. Removing the LEGACY path removes one explicit chatbot.llm override/restore cycle, but it does not eliminate:
- service-level request overrides that still mutate
chatbot.llmbefore graph execution - shared fallback wrapper state on cached chatbot instances
So legacy deprecation reduces how much migration effort should be spent on the old path, but it does not remove the need for request-scoped route ownership in the NEW system.
The first migration step should therefore be:
- resolve a fresh route object per request
- reuse cached provider clients internally where appropriate
- return actual execution information from invocation results instead of shared wrapper state
LangChain And Langfuse In This Design¶
This direction keeps the current Langfuse and LangChain/LangGraph integration model intact.
LangChain / LangGraph¶
LangChain and LangGraph remain the execution abstraction. The new resolver decides what route to use; it does not replace the runtime composition model.
init_chat_model may still be useful later as an implementation detail to reduce branching in client_factory.py, but it is not the system boundary and not the architectural answer by itself.
Langfuse¶
Langfuse remains the observability layer:
- call sites still invoke models with Langfuse callback handlers
- the resolver returns stable declared route metadata at resolution time
- the invocation result returns stable actual execution metadata after the call
The main improvement is that call sites no longer need to reconstruct routing truth themselves.
This matters even more if OpenRouter is added later, because OpenRouter may handle some provider fallback internally. In that case, the resolver should normalize response metadata into the same declared-route / actual-execution shape used elsewhere.
Warmup and tracing should follow the same rule:
- startup warmup should consume the resolver's internal policy registry
- request tracing should consume the request-scoped resolved route plus invocation result
Langfuse-driven model or provider overrides exist today, but they are not part of the target architecture. The rule for live traffic should be simple:
- the configured route for a use case comes from the code-owned routing config
- Langfuse prompt metadata does not change model, provider, or fallback selection
- prompt metadata does not modify
code_fallback - there is no separate runtime override path in the request pipeline
- the request API itself does not expose a model/provider override escape hatch
During migration, this should be handled as a compatibility transition rather than a second routing mode:
- local prompt frontmatter and Langfuse prompt config may still contain
model/providermetadata for unmigrated paths - migrated paths ignore those fields completely for route selection
- once those paths are migrated, the obsolete prompt-side routing metadata should be removed instead of left behind as dead config
Replay and evaluation workflows are a separate concern. If someone wants to try another model for an existing trace, they should copy the resolved prompt from that trace into Langfuse Playground and run the experiment there. That should not introduce a second routing control surface into the production pipeline.
Mapping Current Use Cases Into The Target Registry¶
This maps the current usage audit into the proposed shape. It is intentionally direct, so the first migration can preserve current behavior.
Prompt-driven model/provider overrides may exist in current code, but they are not part of the target architecture for any use case below.
| Use Case | Current Primary | Current Fallback | Kind | Notes |
|---|---|---|---|---|
answer_writer |
openai/gpt-5.4 |
openrouter/openai/gpt-5.2 on provider outage / timeout / rate limit |
chat |
Current default route is hardcoded in answer.py, not inherited from chatbot config |
redirect_handler |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Same behavior as today |
booking_handler |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Same behavior as today |
intent_classifier |
azure/gpt-4.1-mini |
openai/gpt-4.1-mini on rate limit |
chat |
Structured output |
interest_signals_detector |
cerebras/gpt-oss-120b |
azure/gpt-4.1-mini on 503/429 |
chat |
Structured output, provider-specific today |
skill_selector |
cerebras/gpt-oss-120b |
azure/gpt-4.1-mini on configured errors |
chat |
Already closest to config-driven routing |
visitor_profiler |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Structured output |
profile_extractor |
azure/gpt-4.1-mini |
openai/gpt-4.1-mini on rate limit |
chat |
Structured output |
follow_up_suggester |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Structured output wrapper |
answer_suggester |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Structured output wrapper |
dialog_supervisor |
azure/gpt-4.1-nano |
openai/gpt-4.1-nano on rate limit |
chat |
Structured output |
query_rewriter |
groq_direct/llama-3.1-8b-instant |
Python pronoun rewrite | chat |
V1's only planned CodeFallback; keep the existing first-turn fast path separate from failure fallback |
conversation_classification |
openai/gpt-4.1-mini |
None today | chat |
ixtagging; structured output against the site taxonomy |
message_translation |
openai/gpt-4.1-mini |
None today | chat |
ixtagging; batch translation of non-English conversations before classification |
conversation_helpfulness_scoring |
openai/gpt-4.1-mini |
None today | chat |
ixtagging; structured output scoring task |
conversation_resolution_scoring |
openai/gpt-4.1-mini |
None today | chat |
ixtagging; structured output scoring task |
lightrag_retrieval |
azure/gpt-4.1-nano |
None today | chat |
Good candidate for later alignment with standard resolver |
lightrag_processing |
azure/gpt-4.1-mini |
None today | chat |
Same as above |
lightrag_embedding |
azure/text-embedding-3-small |
None today | embedding |
Separate resolver branch, not BaseChatModel |
lightrag_rerank |
cohere/rerank-v3.5 |
Return original docs | rerank |
Non-chat domain, keep separate type |
Migration Sequence¶
The migration should be incremental and should follow the architecture, not the other way around.
Phase 0: source-of-truth convergence¶
- Encode the current live routing behavior in per-package Python policy definitions (e.g.
ixchat/routing.py,ixtagging/routing.py,ixrag/routing.py) and Python config defaults before introducing the resolver cutover. - This convergence must include the live
answer_writerroute (openai/gpt-5.4withopenrouter/openai/gpt-5.2fallback), the standard Azure/OpenAI chat routes, the current LightRAG routes, the skills route, the query rewriter route, and the currentixtaggingroutes for classification, translation, and scoring. - Converge on one deployed routing policy for all deployed environments. Lower environments should validate the same use-case routes that production will use rather than keep alternate model/provider/fallback mixes.
- Remove model/provider override fields from live request models and stop accepting request-time routing overrides through the API layer.
- Remove model/provider/fallback routing keys from
backend/apps/shared_data/config/*.tomlonce Python reflects the live behavior. - Stop treating prompt-side
model/providermetadata as routing authority for migrated use cases. During the transition, keep that metadata only where unmigrated paths still depend on it. - After the affected use cases are migrated, remove obsolete prompt-side routing metadata from local prompt frontmatter and Langfuse prompt config.
- Keep TOML only for non-routing application settings after this phase.
This phase exists to avoid an accidental behavior change during the architecture migration. Today the Python defaults and the TOML-backed runtime values are not identical, so simply deleting TOML routing keys would change live behavior before the new resolver is even in place. The intended end state is not "base policy plus per-environment route tweaks"; it is one deployed routing policy, validated in lower environments and then promoted unchanged.
Phase 1: request-scoped resolver foundation¶
- Introduce request-scoped route resolution and invocation results.
- Stop relying on shared mutable chatbot state for routing truth. In migrated paths, nodes should not assign to
chatbot.llm; they should invoke a request-scoped resolved use case directly. - Start returning actual execution information from invocation results.
- Keep existing provider clients and fallback wrappers where helpful instead of fully redesigning every backend immediately, but use them only behind resolver-owned policy.
- Make execution policy explicit for migrated routes, including timeout, retry, fallback mode, fallback triggers, and streaming failover behavior.
- Treat the current mid-stream fallback behavior in
FallbackLLM.astreamas a real bug to fix during this phase. The wrapper must not switch to a fallback provider after any response chunk has already been emitted. - Stop introducing new
chatbot.llmmutation in the NEW system. The active route for a request should live in the resolved use-case handle, not on the shared chatbot instance. - Remove request-time model override handling from the production query pipeline rather than carrying it forward behind the resolver.
- Do not carry prompt-level or runtime model/provider override behavior into the new resolver. Live route selection should come only from the code-owned routing config.
This phase is about correctness under concurrency.
Phase 2: migrate the standard ixchat chat paths¶
The best initial wave is the ixchat chat-model use cases that already rely on get_llm_for_prompt() or service.get_llm():
intent_classifierprofile_extractorfollow_up_suggesteranswer_suggesterdialog_supervisorvisitor_profilerredirect_handlerbooking_handleranswer_writer
These can move first because they already depend on the main ixllm client path and mostly differ only by:
- use-case name
- structured vs non-structured output
- whether the route uses a shared standard profile or a use-case-specific primary/fallback pair such as
answer_writer
Within this wave, follow_up_suggester and answer_suggester are especially good first proofs for the adapter boundary because they already require provider-specific structured-output handling. Migrating them through the resolver should demonstrate that adapters own the invocation strategy itself, not just a yes/no capability check.
For these migrated paths, nodes should resolve and invoke a request-scoped ResolvedChatUseCase directly instead of reading from or writing to chatbot.llm.
This is the point where IXChatbot starts becoming a site-scoped container (retriever, memory, graph, service access, site metadata) rather than the holder of the request's active LLM route.
Phase 3: migrate the specialized chat paths¶
interest_signals_detectorskill_selector
These are slightly more specialized because they rely on non-Azure primaries plus explicit custom fallback rules.
If the LEGACY system is still present during this phase, treat it as a compatibility path rather than the target shape:
- avoid further architectural investment in legacy-only
chatbot.llmmutation patterns - keep any remaining legacy mutation localized until the path is removed
- do not let legacy compatibility drive the ownership model of the new resolver
Phase 4: migrate the special-case nonstandard paths¶
query_rewriter- LightRAG text generation
These are special because they either use groq_direct or a parallel LightRAG-specific stack.
For query_rewriter, Phase 4 should preserve the current behavior shape but move ownership into the resolver:
- keep the
groq_directprimary route - keep the first-turn Python fast path as a call-site or use-case optimization
- move the current Python-on-LLM-error behavior into resolver-owned
CodeFallback - do not generalize
CodeFallbackbeyondpython_pronoun_rewritein this phase unless a second real use case appears
Phase 5: migrate production batch/API LLM workloads¶
ixtaggingconversation classificationixtaggingmessage translationixtagginghelpfulness scoringixtaggingresolution scoring
These should join the same routing control plane even though they are not part of the request-time chat graph. They are production backend behavior, they persist derived analytics data, and today they still bypass shared routing by calling create_llm_client(...) directly.
Once the ixchat resolver surface is stable, these are a good next wave because they do not need request-scoped graph plumbing or streaming semantics, but they should still resolve model/provider policy through ixllm so production backend behavior does not keep a second hardcoded routing system.
For V1, it should be explicit that evaluation and ad hoc tooling are not part of this migration wave:
ixevaluation- CLI regenerate flows
- other internal tooling where direct model selection is intentional
Those flows can continue to choose models directly or use mocks/stubs for tests, but they should remain clearly outside the deployed routing control plane. They are not a reason to preserve environment-specific live routing for the same production use case.
Phase 6: migrate non-chat workloads¶
- embeddings
- rerank
These should share the same policy registry, but not necessarily the same runnable-construction code path as chat models.
Phase 7: optional backend simplification¶
After the resolver and policy model are stable:
- Use
init_chat_modelif it helps simplify internal factory branching. - Formalize the already-implemented
openrouterprovider as an explicitChatBackendbehind the resolver instead of leaving it as answer-node-specific wiring. - Migrate selected chat use cases beyond the already-landed answer-node fallback path to OpenRouter where the operational tradeoff makes sense.
At that point, OpenRouter can become the backbone for many chat use cases if it proves operationally useful, but it should still remain one backend option behind your own routing layer, not the solution to every workload.
This phase is explicitly out of scope for the first implementation unless separately requested, aside from preserving and eventually absorbing the already-landed answer-node/OpenRouter route into the same resolver-owned architecture.
Final Recommendation¶
Build the internal routing layer first, then simplify backends behind it.
That means:
- Fix correctness first with request-scoped, non-mutating route resolution.
- Unify routing policy through
ixllmwith typed use-case policies (defined per-package, assembled at startup) and explicit execution policy. - Use workload-specific adapters plus capability validation so route changes are safe.
- Drive warmup, tracing, and observability from the same resolver outputs.
- Migrate call sites incrementally.
- Add OpenRouter later where it meaningfully simplifies chat, not as a replacement for the architecture itself.
For clarity, the minimum acceptable first delivery is:
- request-scoped route resolution and invocation results
- unified policy ownership for the migrated
ixchatchat paths (defined inixchat, resolved throughixllm) - no prompt-level or runtime routing overrides in the live request path
- explicit execution policy for migrated routes, including streaming failover behavior and a fix for the current mid-stream fallback bug
- no new
chatbot.llmmutation in migrated NEW-system paths - warmup and observability aligned with resolver-owned routing data
Expected Benefits¶
Once this architecture is in place, the main benefits should be easy to see in two areas: engineering maintainability and runtime operations.
Engineering / maintainability¶
- Model changes become policy changes instead of repeated code edits across many call sites.
- Provider switches become safer because capability checks happen before runtime.
- Provider switches become safer because timeout, retry, and fallback behavior are part of the use-case contract instead of hidden adapter defaults.
- Provider-specific logic is isolated in
ixllminstead of leaking into business logic nodes. - Fallback behavior is defined once instead of being split across
ixchat, wrappers, and special cases. - Shared route profiles make bulk route changes easier and less error-prone.
- New use cases should mostly mean adding a policy entry rather than writing new provider-wiring logic.
- New providers can be added behind the same resolver boundary instead of forcing call-site rewrites.
- The system can migrate incrementally instead of requiring a big-bang rewrite.
Operations / runtime behavior¶
- Request-scoped routing removes the current shared-state risk around fallback tracking and tracing.
- Declared route and actual execution are reported separately, so observability becomes more accurate even for code fallbacks and graceful degradation.
- Warmup can use the same resolver-owned internal policy registry as runtime routing, reducing drift.
- Langfuse prompt metadata stops being a hidden production routing surface.
- Fallback behavior becomes more predictable because one resolver owns the order and trigger rules.
- Streaming behavior becomes more predictable because failover is explicit instead of implicit mid-stream.
- Cached provider clients can still be reused underneath, so cleaner architecture does not require sacrificing connection reuse.
- If OpenRouter usage expands later, it can simplify many chat workloads without changing the system boundary again because it is already just another backend behind the resolver.
- Chat, embeddings, and rerank can share one routing control plane without being forced into the same runtime shape.
This gives you the biggest simplification with the least operational risk and aligns with the actual goal: make model and provider switches easy, explicit, and maintainable across the whole backend, not just for one subset of chat calls.
Source References¶
Repository references¶
backend/packages/ixchat/ixchat/service.pybackend/packages/ixchat/ixchat/utils/model_override.pybackend/packages/ixllm/ixllm/client_factory.pybackend/packages/ixllm/ixllm/fallback_llm.pybackend/packages/ixllm/ixllm/prompts/langfuse.pybackend/packages/ixrag/ixrag/lightrag/lightrag_llm.pybackend/packages/ixtagging/ixtagging/service.pybackend/packages/ixtagging/ixtagging/classifier.pybackend/packages/ixtagging/ixtagging/translator.pybackend/packages/ixtagging/ixtagging/scorer.pybackend/packages/ixinfra/ixinfra/config/settings.py
External docs reviewed¶
- LangChain
init_chat_model - https://reference.langchain.com/python/langchain/models/#langchain.chat_models.init_chat_model
- LangChain configurable models
- https://docs.langchain.com/oss/python/langchain/models#configurable-models
- LangChain configurable alternatives
- https://reference.langchain.com/python/langchain_core/language_models/#configurable-alternatives-https-reference-langchain-com-python-langchain-core-language-models-langchain-core-language-models-basechatmodel-configurable-alternatives-copy-anchor-link-to-this-section-for-reference
- LangChain base URL / OpenAI-compatible providers
- https://docs.langchain.com/oss/python/langchain/models#base-url-or-proxy
- LangChain
ChatOpenAIcustom provider parameters - https://reference.langchain.com/python/integrations/langchain_openai/ChatOpenAI/#custom-provider-parameters
- LangChain
ChatOpenRouter - https://reference.langchain.com/python/langchain-openrouter/chat_models/ChatOpenRouter
- OpenRouter model routing
- https://openrouter.ai/docs/features/model-routing#model-routing
- OpenRouter provider ordering and fallbacks
- https://openrouter.ai/docs/guides/routing/provider-selection.mdx#ordering-specific-providers
- OpenRouter parameter compatibility filtering
- https://openrouter.ai/docs/features/provider-routing?codeTab=Python+Example+with+Fallbacks+Enabled#requiring-providers-to-support-all-parameters
- Langfuse LangChain / LangGraph integration
- https://github.com/langfuse/langfuse-docs/blob/main/pages/integrations/frameworks/langchain.mdx?plain=1#L13#langchain-tracing-langgraph-integration