Lookup Tables¶

Overview¶

Lookup tables let the Website Agent answer structured-coverage questions against tenant-supplied data: supported industries, covered countries, available integrations, plan tiers, French CCNs, and so on. The agent matches what the visitor said against entries in the table, surfaces a verdict (covered, uncovered, …), and feeds qualification semantics into demo CTA gating.

Two production surfaces consume the same signal:

Voice — the answer node grounds its reply on the canonical entry name + verdict + response hint.
Qualification — qualifies=true|false writes into cta_blocking_fields / interest signals so the demo CTA respects deliberate exclusions (e.g. "we don't sell to Defense").

End-to-end flow¶

flowchart TD A["Visitor message"] --> EX["early_profile_extractor (LLM pulls upstream fields, e.g. industry)"] EX --> C["lookup_resolver_node (reads collected profiling values)"] C --> Q{"per table: query_field set?"} Q -- "yes" --> QV{"collected_values[query_field] non-empty?"} QV -- no --> SK["Skip table this turn (wait for next extraction)"] QV -- yes --> D["Run lookup.match_entry against extracted value"] Q -- "no" --> TP{"Third-person / research framing?"} TP -- yes --> SK TP -- no --> D D --> E{"Any non-unknown match?"} SK --> E E -- no --> Z["No new signal — prior latched signal preserved"] E -- yes --> F["Pick highest-confidence signal across tables"] F --> G["Write LookupSignal to state"] G --> H["skill_applier_rules injects lookup_data skill (pure data render)"] H --> I{"Clear winner / multi-entity?"} I -- "clear winner" --> J["Write canonical_name to ProfilingState[populates_field]"] I -- "multi-entity" --> K["Render every matched entry in lookup_data block"] I -- "ambiguous" --> L["Render candidates list in lookup_data block"] J --> M["Per-client response_handling skill decides how to ground / ask"] K --> M L --> M Z --> M

The resolver runs after early_profile_extractor (serial edge in the graph). The extractor is the upstream stage that produces industry, country, etc.; the resolver consumes those values as match queries via query_field. This decouples the lookup signal from the visitor's conversational framing ("can you support my paye?") and grounds it on a qualified upstream value ("éditeur de logiciel").

Match-tier hierarchy¶

The Postgres function lookup.match_entry walks four tiers top-to-bottom and returns the first non-empty result. All surviving matches have confidence >= 0.85.

flowchart TD Q["Visitor query"] --> N["normalize (lower + unaccent)"] N --> T1{"Tier 1 — id_exact"} T1 -- match --> O1["conf 1.00 — LIMIT 1"] T1 -- miss --> T2{"Tier 2 — name_exact / alias_exact"} T2 -- match --> O2["conf 0.95 — LIMIT 1"] T2 -- miss --> T3{"Tier 3 — name_substring / alias_substring"} T3 -- match --> O3["conf 0.85–0.95 — LIMIT 3 multi-entity capable"] T3 -- miss --> T4{"Tier 4 — semantic (pgvector cosine)"} T4 -- match --> O4["conf 0.85–0.95 — LIMIT 3 cutoff 0.30 — LLM verifies"] T4 -- miss --> X["unknown — no signal"]

Tier	Type	Confidence	Limit	`matched_via` value
1	`id_exact`	1.00	1	`id_exact`
2	`name_exact`, `alias_exact`	0.95	1	`name_exact` / `alias_exact`
3	`name_substring`, `alias_substring`	0.85 – 0.95	3	`name_substring` / `alias_substring`
4	`semantic` (pgvector)	0.85 – 0.95	3	`semantic`

Tiers 1–3 are deterministic — the matched name or alias is literally in the visitor message. Tier 4 is loose-recall semantic similarity. Both surface through the same lookup_data block; per-client response_handling skills decide whether to ground (typically first-person identification) or treat the match as background (typically third-person / curiosity framing). matched_via is exposed as a placeholder so client skills can branch on tier if needed.

Signal post-processing¶

After the matcher returns rows, the resolver builds one LookupSignal and classifies it into one of three render modes:

flowchart LR R["Matcher rows"] --> M{"is_multi_entity?"} M -- "yes (several deterministic, distinct entry_ids)" --> ME["Multi-entity block list every match"] M -- no --> W{"is_clear_winner?"} W -- "yes (top conf ≥ 0.85, margin ≥ 0.01)" --> CW["Single grounding + profiling write"] W -- "no (ambiguous semantic)" --> AMB{"disambiguation_enabled?"} AMB -- yes --> DIS["lookup_disambiguation skill (agent asks visitor to pick)"] AMB -- no --> CW

Clear winner → one entry dominates. Singular fields drive the prompt; populates_field writes canonical_name into ProfilingState.collected_values so downstream qualification can use it.
Multi-entity → the visitor mentioned several distinct entries on the substring tier (e.g. "we run hospitals AND schools" → Healthcare + Education). Every match is rendered inside the lookup_data block; per-client response_handling decides whether to ground on each.
Ambiguous → multiple semantic candidates, no clear top. The candidates list is rendered inside the lookup_data block under "alternates (no clear winner):"; per-client response_handling decides whether to ask the visitor to pick (typically when disambiguation_enabled=true on the table).

Voice surface — what the LLM sees¶

Lookup is a pure data layer. One global skill lookup_data renders the matched entry, verdict, candidates, and metadata as structured facts — no decision prose. Per-client response_handling skills (e.g. clients/payfit.com/response_handling/industry_coverage/SKILL.md) own the decisions: when to ground vs ask, opening-word mapping, CTA gating, tone, disambiguation framing.

Layer	Where	What it does
Data	`system_context/lookup_data/SKILL.md`	Renders `{{canonical_name}}`, `{{verdict}}`, `{{confidence}}`, `{{candidates_list}}`, `{{multi_entity_block}}`, … as facts. One data-integrity hint: "use canonical_name and key_label verbatim when grounding".
Decision	`clients/<domain>/response_handling/industry_coverage/SKILL.md` (or equivalent)	Verdict↔opening-word mapping, first-person vs third-person framing, demo CTA gating, disambiguation prompt shape, language.

_inject_lookup_data in skill_applier_rules.py fires the data skill whenever state.lookup_signal is present; per-tenant clients/<domain>/system_context/lookup_data/SKILL.md can override the data render itself, though clients normally only need a response_handling decision skill.

CTA gating happens through the qualification subsystem (qualifies / cta_blocking_fields), never through prompt wording in the data block.

Example rendered block:

=== LOOKUP TABLE DATA ===
Tool: lookup_supported_industries
Visitor message: "we're a small hospital, would your product work for us?"

  matched entry:  Healthcare (industry_code HEALTH_002)
  verdict:        covered
  confidence:     0.910
  matched_via:    alias_substring

response_hint: Confirm coverage. Mention the canonical industry name and
industry_code. Then propose the demo CTA.

When grounding the reply on this data, use ``canonical_name`` and
``industry_code`` verbatim — paraphrasing makes the verdict ambiguous.
=== END LOOKUP TABLE DATA ===

When the resolver returns multiple candidates with no clear winner, the block adds an "alternates (no clear winner):" footer listing them; the per-client decision skill decides whether to ask the visitor to pick.

Qualification interplay¶

Lookup feeds two decisions on every turn it fires: how the agent talks about the entry (voice) AND whether the demo CTA is allowed (qualification). Three concepts compose to produce the qualification side.

1. `qualifies` flag¶

Every verdict declares whether matching it makes the visitor demo-eligible:

verdicts:
  - value: covered
    qualifies: true          # Healthcare / SaaS / Retail → demo OK
  - value: uncovered
    qualifies: false         # Gambling / Defense / Crypto → demo blocked
    disqualification_reason: industry_not_supported

Value	Effect
`true`	Positive signal — lookup confirms the visitor is in scope.
`false`	Blocker — `disqualification_reason` joins the visitor's blocking-fields set. Demo CTA suppressed even on explicit `BOOKING` intent later in the conversation.
`null`	Neutral — lookup hit but no opinion on demo gating.

Concretely: visitor says "we run a casino" → uncovered → qualifies=false, disqualification_reason=industry_not_supported. The demo_offer skill sees the blocker and refuses to fire 👇.

2. Signal latching¶

The resolver writes LookupSignal into RoseChatState.lookup_signal. On the next turn it runs again — but if the new visitor message doesn't match anything, it leaves the prior signal in place.

Turn	Visitor says	Resolver	`state.lookup_signal`
1	"we're a sports betting platform"	Match → Gambling, `qualifies=false`	Gambling
2	"what integrations do you support?"	No match	Gambling (latched)
3	"ok book a demo"	No match	Gambling (latched) — demo refused

Without latching, turn 2's "no match" would clear the signal and turn 3 could push the demo CTA — defeating the deliberate exclusion. Latching means a disqualification you collected ONCE survives until the visitor explicitly says something that changes it.

3. `query_field` and `populates_field` — lookup as a profiling stage¶

Qualification has 👉 questions that collect specific fields (team_size, use_case, industry, convention_collective, …). When cta_blocking_fields lists those, the demo CTA waits until they're collected. Fields normally get filled by:

👉 question → visitor answers → profile_extractor (LLM) pulls the answer into ProfilingState.collected_values[field].

A lookup table can plug into that pipeline as either an input consumer (query_field) or an output producer (populates_field), or both. The two-stage pattern is:

table_id: ccn_coverage
query_field: industry                # input — what the resolver matches on
populates_field: convention_collective  # output — what a clear-winner match writes

early_profile_extractor runs first (LLM). It extracts industry from the conversation.
lookup_resolver runs next (serial edge). It reads collected_values["industry"] and runs the CCN matcher on that value, not on the raw visitor message.
On a clear-winner match, the resolver writes collected_values["convention_collective"] with the canonical CCN name.

Two query-source modes coexist per table:

Profiling-driven (query_field set) — the resolver matches on the upstream extracted value. Use when the visitor's framing is noisy (industry mentioned inside a question about something else) and the extractor can pull out the concept reliably.
Raw-input (query_field unset) — the resolver matches on the visitor's verbatim message and applies the third-person guard. Use when the visitor will literally name the entry (country, integration, plan tier) and the deterministic tiers can lift it straight from the text.

When query_field is set but the value is empty, the table is skipped for the turn — the resolver re-runs next turn once the extractor fills the upstream field.

Implicit lookup ownership¶

Any field that appears as populates_field on a bound lookup table is implicitly lookup-owned. early_profile_extractor queries lookup_owned_fields_for_domain(domain) and strips those fields from its unfilled-fields list before prompting the LLM. The lookup resolver becomes the single writer, so the canonical name stored in collected_values always matches a real entry in lookup.entries — never a paraphrase the LLM invented.

The clear-winner gate matters: ambiguous matches (e.g. industry → both Strategy Consulting and IT Consulting) DO NOT write profiling. The resolver can't tell which one to commit to, so the field stays empty and either lookup_disambiguation or a regular 👉 collects it.

Composition¶

On one visitor turn:

Resolver runs → produces signal (or keeps the latched one).
Signal carries qualifies (CTA gate input) + optional populates_field write (qualification input).
demo_offer skill checks: any blocking field unsatisfied? any qualifies=false signal active? → if either, no 👇.
Voice path: lookup_result_authoritative (deterministic tier) or lookup_result_candidate (semantic tier) skill renders canonical name + verdict in the answer prompt.

Lookup is therefore both a voice grounding source AND a qualification participant, in a single pass.

Multi-entity render¶

When the substring tier returns several distinct entries, the block prepends a multi-entity sub-block:

MULTIPLE ENTRIES MATCHED (visitor mentioned several):
  - Healthcare (industry_code HEALTH_002) — verdict covered
  - Education (industry_code EDU_004) — verdict covered
Ground on EACH entry above by name and industry_code. Do NOT drop any.
If a visitor question covers all of them, answer for every one —
concatenating the per-entry verdict.

The clear-winner profiling write is suppressed in this case (the resolver only writes when there's exactly one dominant entry), so qualification stays unfilled and a downstream 👉 can disambiguate if needed.

Configuration¶

Tables live in their own Postgres schema (lookup.tables + lookup.entries), not in config.client_configs. Manage them through the rose-config lookup CLI:

# Inline yaml fixture (hand-curated, small tables)
rose-config lookup import \
    --site <domain> \
    --mapping backend/apps/shared_data/lookup_mappings/<table>.yaml

# Tenant xlsx + sheet/column mapping yaml
rose-config lookup import \
    --site <domain> \
    --mapping <mapping.yaml> \
    --source <data.xlsx>

# List, debug-match
rose-config lookup list  --site <domain>
rose-config lookup match --site <domain> --table-id <table_id> --query "<text>"

Each table carries:

Field	Purpose
`table_id`	Stable handle, snake_case (e.g. `supported_industries`).
`entry_term`	Data-model label (`industry`, `country`, `integration`).
`visitor_topic`	Visitor-facing label rendered in the prompt (`their industry or sector`).
`key_label`	What `entry_id` means to the visitor (`industry_code`, `IDCC`, `ISO`).
`verdicts[]`	Allowed outcomes — value, label, response_hint, qualifies, disqualification_reason.
`query_field`	Profiling field id whose extracted value drives the match query (optional). When set, the resolver matches on `collected_values[query_field]` instead of the raw visitor message. Leave unset when the visitor's verbatim mention IS the match query (e.g. they type the country name directly).
`populates_field`	Qualification field to write on clear-winner match (optional). The named field becomes implicitly lookup-owned: `early_profile_extractor` strips it from its unfilled-fields list so the LLM extractor never produces it, leaving the lookup as the single source of truth.
`disambiguation_enabled`	Fire the `lookup_disambiguation` skill on ambiguous semantic matches.

See backend/apps/shared_data/lookup_mappings/ for full examples (CCN xlsx import, testfeatures yaml inline).

Entry metadata — what NOT to put there¶

Per-entry metadata: {key: value} is freeform but reserved for entry-level descriptors (region tags, internal labels, locale codes). It does not carry landing-page URLs:

Sector / topic landing pages flow through content_recommendation (curated_content_groups + curated_content_items in Supabase). That subsystem is the single source of truth for "if visitor talks about X, surface this page".
Storing URLs in lookup metadata creates concept overlap with content_recommendation and primes the answer LLM to fabricate plausible paths when the field is unset. Don't.
If a tenant wants a sector landing page surfaced on lookup match, they configure it via curated content keyed to the same topic vocabulary. Lookup stays focused on verdict + canonical_name.

When to add a lookup table¶

Three conditions to meet:

The client provides a structured list of "supported X" vs "not supported X" rows.
Visitors will mention the topic (industry, country, integration, plan) and expect a verdict.
The list is long (>10 entries) or expected to change over time.

If all three hold, a lookup table is strictly better than encoding the same information in a hand-written skill body: the matcher catches mentions the LLM would miss, the verdict is authoritative, and the data is editable without redeploying prompts.

Taxonomy¶

Vocabulary used throughout this page and across the lookup codebase.

Data shapes¶

Term	What it is
Lookup table	One tenant-scoped row in `lookup.tables` — metadata for a coverage concept (`supported_industries`, `country_coverage`, `ccn_coverage`). Identified by `table_id`.
Entry	One row in `lookup.entries` — a specific value the table can return (e.g. `Healthcare`, `Gambling`). Identified by `entry_id` (e.g. `HEALTH_002`).
Alias	Alternate phrasing a visitor might use for an entry (`hospital`, `clinic` → `Healthcare`). Stored both raw and normalized.
Verdict	Declared outcome the matcher can return for an entry (`covered`, `uncovered`, …). Each verdict carries a `response_hint` (voice) and qualification semantics (`qualifies`, `disqualification_reason`).

Naming surface¶

Term	Used for	Example
`entry_term`	Data-model label for one row, used in tool descriptions.	`industry`, `country`, `convention collective`
`visitor_topic`	Visitor-facing label rendered in the prompt, framing what the visitor mentioned.	`their industry or business sector`
`key_label`	What `entry_id` represents to the visitor; surfaced verbatim.	`industry_code`, `IDCC`, `ISO code`
`canonical_name`	The entry's official name as stored. The agent must use it verbatim.	`Healthcare`, `Strategy Consulting`

Match tiers¶

Term	Meaning
`id_exact`	Query contains the literal `entry_id` (e.g. visitor pastes `HEALTH_002`). Confidence 1.00.
`name_exact` / `alias_exact`	Normalized query equals the canonical name or one of its aliases. Confidence 0.95.
`name_substring` / `alias_substring`	Canonical name or alias appears as a substring of the query. Confidence 0.85 – 0.95. Multi-entity capable.
`semantic`	pgvector cosine similarity against the entry's Q/A embedding. Confidence 0.85 – 0.95 after rescaling above the cutoff.
Deterministic tier	Tiers 1–3. The match is literally in the query, so the verdict is authoritative.
Semantic tier	Tier 4. The match is a paraphrase / self-ID hint; the answer LLM verifies before grounding.

Resolver output¶

Term	Meaning
`LookupSignal`	Resolver-produced summary written to `RoseChatState.lookup_signal`. Carries the top-1 fields PLUS the full `candidates` list, qualification semantics, and `matched_query`. One signal per turn — the runtime surface the answer node sees.
`LookupCandidate`	One row inside `LookupSignal.candidates`. Used to render alternates and to detect multi-entity / ambiguity.
`LookupMatchResult`	Per-tool RPC return value, shaped for an LLM tool-binding flow that is not currently wired into the answer node. Kept for a possible future where lookup tools are bound directly on the answer LLM; today's runtime path goes resolver → state → skill, not tools.

Render modes¶

Term	When it fires	Behavior
AUTHORITATIVE block	Match came from a deterministic tier (1–3). Rendered from `lookup_result_authoritative/SKILL.md`.	LLM must ground on the verdict; canonical name + key_label used verbatim.
CANDIDATE block	Match came from the semantic tier (4). Rendered from `lookup_result_candidate/SKILL.md`.	LLM verifies the match is the visitor's real topic before grounding. Default-reject on third-person / research framing.
Clear winner	`top.confidence ≥ 0.85` AND `top - runner_up ≥ 0.01`.	Singular grounding + optional profiling write.
Multi-entity	2+ deterministic candidates with distinct `entry_id`s (e.g. "hospitals AND schools").	Every match rendered in the prompt; LLM grounds on each. No profiling write.
Ambiguous	Several semantic candidates, no clear winner.	When `disambiguation_enabled=true` on the table, `lookup_disambiguation` skill fires — agent lists candidates and asks 👉.

Qualification semantics¶

Term	Meaning
`qualifies`	Per-verdict CTA gate signal. `true` = demo allowed, `false` = demo blocked, `null` = neutral.
`disqualification_reason`	Token stored when `qualifies=false`, fed into `cta_blocking_fields` / interest signals.
Signal latching	A `qualifies=false` signal persists on state across follow-up turns until something explicitly overwrites it. Prevents a generic follow-up from accidentally re-enabling the demo CTA.
`query_field`	Table-level setting. When set, the resolver builds its match query from `ProfilingState.collected_values[query_field]` (extracted upstream by `early_profile_extractor`) instead of the raw visitor message. When the value is empty, the table is skipped this turn.
`populates_field`	Table-level setting. When a clear-winner match resolves, `canonical_name` is written into `ProfilingState.collected_values[populates_field]`, satisfying that field for qualification without asking 👉.
Lookup-owned field	Any profiling field appearing as `populates_field` on a bound lookup table. `early_profile_extractor` strips lookup-owned fields from its prompt so the LLM never writes them — the resolver is the single source of truth for the canonical value.

Behavioral guards¶

Term	Meaning
Third-person guard	Pre-resolver heuristic (`looks_like_third_person`) that short-circuits the matcher when the visitor uses research / curiosity / topical framing ("how do casinos use chatbots", "for a blog post I'm writing"). Prevents incidental mentions from producing a coverage verdict.
CANDIDATE verification	Defensive prompt directive on semantic-tier hits. Default-reject unless the visitor is speaking AS / FROM the topic in first-person.

docs/src/backend/skills.md — generic skill-system primer (frontmatter, categories, selection).
Data skill: backend/apps/shared_data/prompts/website-agent/skills/system_context/lookup_data/SKILL.md (pure placeholder render).
Per-client decision skills (examples): clients/payfit.com/response_handling/industry_coverage/SKILL.md, clients/testfeatures.com/response_handling/industry_coverage/SKILL.md.
docs/src/agents/website-agent/qualification-flow.md — how qualifies / disqualification_reason flow into CTA gating.
docs/src/unified-config-system.md — config vs lookup boundary.
Source: backend/packages/ixchat/ixchat/tools/lookup/ (factory, models, normalize, renderer).
E2E eval: feature-lookup dataset, rose-eval features run lookup.