Skip to content

Lookup Tables

Overview

Lookup tables let the Website Agent answer structured-coverage questions against tenant-supplied data: supported industries, covered countries, available integrations, plan tiers, French CCNs, and so on. The agent matches what the visitor said against entries in the table, surfaces a verdict (covered, uncovered, …), and feeds qualification semantics into demo CTA gating.

Two production surfaces consume the same signal:

  • Voice — the answer node grounds its reply on the canonical entry name + verdict + response hint.
  • Qualificationqualifies=true|false writes into cta_blocking_fields / interest signals so the demo CTA respects deliberate exclusions (e.g. "we don't sell to Defense").

End-to-end flow

flowchart TD A["Visitor message"] --> EX["early_profile_extractor<br/>(LLM pulls upstream fields,<br/>e.g. industry)"] EX --> C["lookup_resolver_node<br/>(reads collected profiling values)"] C --> Q{"per table:<br/>query_field set?"} Q -- "yes" --> QV{"collected_values[query_field]<br/>non-empty?"} QV -- no --> SK["Skip table this turn<br/>(wait for next extraction)"] QV -- yes --> D["Run lookup.match_entry<br/>against extracted value"] Q -- "no" --> TP{"Third-person /<br/>research framing?"} TP -- yes --> SK TP -- no --> D D --> E{"Any non-unknown match?"} SK --> E E -- no --> Z["No new signal — prior latched signal preserved"] E -- yes --> F["Pick highest-confidence signal across tables"] F --> G["Write LookupSignal to state"] G --> H["skill_applier_rules injects<br/>lookup_data skill (pure data render)"] H --> I{"Clear winner / multi-entity?"} I -- "clear winner" --> J["Write canonical_name to<br/>ProfilingState[populates_field]"] I -- "multi-entity" --> K["Render every matched entry<br/>in lookup_data block"] I -- "ambiguous" --> L["Render candidates list<br/>in lookup_data block"] J --> M["Per-client response_handling skill<br/>decides how to ground / ask"] K --> M L --> M Z --> M

The resolver runs after early_profile_extractor (serial edge in the graph). The extractor is the upstream stage that produces industry, country, etc.; the resolver consumes those values as match queries via query_field. This decouples the lookup signal from the visitor's conversational framing ("can you support my paye?") and grounds it on a qualified upstream value ("éditeur de logiciel").

Match-tier hierarchy

The Postgres function lookup.match_entry walks four tiers top-to-bottom and returns the first non-empty result. All surviving matches have confidence >= 0.85.

flowchart TD Q["Visitor query"] --> N["normalize<br/>(lower + unaccent)"] N --> T1{"Tier 1 — id_exact"} T1 -- match --> O1["conf 1.00 — LIMIT 1"] T1 -- miss --> T2{"Tier 2 — name_exact / alias_exact"} T2 -- match --> O2["conf 0.95 — LIMIT 1"] T2 -- miss --> T3{"Tier 3 — name_substring / alias_substring"} T3 -- match --> O3["conf 0.85–0.95 — LIMIT 3<br/>multi-entity capable"] T3 -- miss --> T4{"Tier 4 — semantic (pgvector cosine)"} T4 -- match --> O4["conf 0.85–0.95 — LIMIT 3<br/>cutoff 0.30 — LLM verifies"] T4 -- miss --> X["unknown — no signal"]
Tier Type Confidence Limit matched_via value
1 id_exact 1.00 1 id_exact
2 name_exact, alias_exact 0.95 1 name_exact / alias_exact
3 name_substring, alias_substring 0.85 – 0.95 3 name_substring / alias_substring
4 semantic (pgvector) 0.85 – 0.95 3 semantic

Tiers 1–3 are deterministic — the matched name or alias is literally in the visitor message. Tier 4 is loose-recall semantic similarity. Both surface through the same lookup_data block; per-client response_handling skills decide whether to ground (typically first-person identification) or treat the match as background (typically third-person / curiosity framing). matched_via is exposed as a placeholder so client skills can branch on tier if needed.

Signal post-processing

After the matcher returns rows, the resolver builds one LookupSignal and classifies it into one of three render modes:

flowchart LR R["Matcher rows"] --> M{"is_multi_entity?"} M -- "yes (several deterministic, distinct entry_ids)" --> ME["Multi-entity block<br/>list every match"] M -- no --> W{"is_clear_winner?"} W -- "yes (top conf ≥ 0.85, margin ≥ 0.01)" --> CW["Single grounding +<br/>profiling write"] W -- "no (ambiguous semantic)" --> AMB{"disambiguation_enabled?"} AMB -- yes --> DIS["lookup_disambiguation skill<br/>(agent asks visitor to pick)"] AMB -- no --> CW
  • Clear winner → one entry dominates. Singular fields drive the prompt; populates_field writes canonical_name into ProfilingState.collected_values so downstream qualification can use it.
  • Multi-entity → the visitor mentioned several distinct entries on the substring tier (e.g. "we run hospitals AND schools" → Healthcare + Education). Every match is rendered inside the lookup_data block; per-client response_handling decides whether to ground on each.
  • Ambiguous → multiple semantic candidates, no clear top. The candidates list is rendered inside the lookup_data block under "alternates (no clear winner):"; per-client response_handling decides whether to ask the visitor to pick (typically when disambiguation_enabled=true on the table).

Voice surface — what the LLM sees

Lookup is a pure data layer. One global skill lookup_data renders the matched entry, verdict, candidates, and metadata as structured facts — no decision prose. Per-client response_handling skills (e.g. clients/payfit.com/response_handling/industry_coverage/SKILL.md) own the decisions: when to ground vs ask, opening-word mapping, CTA gating, tone, disambiguation framing.

Layer Where What it does
Data system_context/lookup_data/SKILL.md Renders {{canonical_name}}, {{verdict}}, {{confidence}}, {{candidates_list}}, {{multi_entity_block}}, … as facts. One data-integrity hint: "use canonical_name and key_label verbatim when grounding".
Decision clients/<domain>/response_handling/industry_coverage/SKILL.md (or equivalent) Verdict↔opening-word mapping, first-person vs third-person framing, demo CTA gating, disambiguation prompt shape, language.

_inject_lookup_data in skill_applier_rules.py fires the data skill whenever state.lookup_signal is present; per-tenant clients/<domain>/system_context/lookup_data/SKILL.md can override the data render itself, though clients normally only need a response_handling decision skill.

CTA gating happens through the qualification subsystem (qualifies / cta_blocking_fields), never through prompt wording in the data block.

Example rendered block:

=== LOOKUP TABLE DATA ===
Tool: lookup_supported_industries
Visitor message: "we're a small hospital, would your product work for us?"

  matched entry:  Healthcare (industry_code HEALTH_002)
  verdict:        covered
  confidence:     0.910
  matched_via:    alias_substring

response_hint: Confirm coverage. Mention the canonical industry name and
industry_code. Then propose the demo CTA.

When grounding the reply on this data, use ``canonical_name`` and
``industry_code`` verbatim — paraphrasing makes the verdict ambiguous.
=== END LOOKUP TABLE DATA ===

When the resolver returns multiple candidates with no clear winner, the block adds an "alternates (no clear winner):" footer listing them; the per-client decision skill decides whether to ask the visitor to pick.

Qualification interplay

Lookup feeds two decisions on every turn it fires: how the agent talks about the entry (voice) AND whether the demo CTA is allowed (qualification). Three concepts compose to produce the qualification side.

1. qualifies flag

Every verdict declares whether matching it makes the visitor demo-eligible:

verdicts:
  - value: covered
    qualifies: true          # Healthcare / SaaS / Retail → demo OK
  - value: uncovered
    qualifies: false         # Gambling / Defense / Crypto → demo blocked
    disqualification_reason: industry_not_supported
Value Effect
true Positive signal — lookup confirms the visitor is in scope.
false Blocker — disqualification_reason joins the visitor's blocking-fields set. Demo CTA suppressed even on explicit BOOKING intent later in the conversation.
null Neutral — lookup hit but no opinion on demo gating.

Concretely: visitor says "we run a casino" → uncovered → qualifies=false, disqualification_reason=industry_not_supported. The demo_offer skill sees the blocker and refuses to fire 👇.

2. Signal latching

The resolver writes LookupSignal into RoseChatState.lookup_signal. On the next turn it runs again — but if the new visitor message doesn't match anything, it leaves the prior signal in place.

Turn Visitor says Resolver state.lookup_signal
1 "we're a sports betting platform" Match → Gambling, qualifies=false Gambling
2 "what integrations do you support?" No match Gambling (latched)
3 "ok book a demo" No match Gambling (latched) — demo refused

Without latching, turn 2's "no match" would clear the signal and turn 3 could push the demo CTA — defeating the deliberate exclusion. Latching means a disqualification you collected ONCE survives until the visitor explicitly says something that changes it.

3. query_field and populates_field — lookup as a profiling stage

Qualification has 👉 questions that collect specific fields (team_size, use_case, industry, convention_collective, …). When cta_blocking_fields lists those, the demo CTA waits until they're collected. Fields normally get filled by:

  • 👉 question → visitor answers → profile_extractor (LLM) pulls the answer into ProfilingState.collected_values[field].

A lookup table can plug into that pipeline as either an input consumer (query_field) or an output producer (populates_field), or both. The two-stage pattern is:

table_id: ccn_coverage
query_field: industry                # input — what the resolver matches on
populates_field: convention_collective  # output — what a clear-winner match writes
  1. early_profile_extractor runs first (LLM). It extracts industry from the conversation.
  2. lookup_resolver runs next (serial edge). It reads collected_values["industry"] and runs the CCN matcher on that value, not on the raw visitor message.
  3. On a clear-winner match, the resolver writes collected_values["convention_collective"] with the canonical CCN name.

Two query-source modes coexist per table:

  • Profiling-driven (query_field set) — the resolver matches on the upstream extracted value. Use when the visitor's framing is noisy (industry mentioned inside a question about something else) and the extractor can pull out the concept reliably.
  • Raw-input (query_field unset) — the resolver matches on the visitor's verbatim message and applies the third-person guard. Use when the visitor will literally name the entry (country, integration, plan tier) and the deterministic tiers can lift it straight from the text.

When query_field is set but the value is empty, the table is skipped for the turn — the resolver re-runs next turn once the extractor fills the upstream field.

Implicit lookup ownership

Any field that appears as populates_field on a bound lookup table is implicitly lookup-owned. early_profile_extractor queries lookup_owned_fields_for_domain(domain) and strips those fields from its unfilled-fields list before prompting the LLM. The lookup resolver becomes the single writer, so the canonical name stored in collected_values always matches a real entry in lookup.entries — never a paraphrase the LLM invented.

The clear-winner gate matters: ambiguous matches (e.g. industry → both Strategy Consulting and IT Consulting) DO NOT write profiling. The resolver can't tell which one to commit to, so the field stays empty and either lookup_disambiguation or a regular 👉 collects it.

Composition

On one visitor turn:

  1. Resolver runs → produces signal (or keeps the latched one).
  2. Signal carries qualifies (CTA gate input) + optional populates_field write (qualification input).
  3. demo_offer skill checks: any blocking field unsatisfied? any qualifies=false signal active? → if either, no 👇.
  4. Voice path: lookup_result_authoritative (deterministic tier) or lookup_result_candidate (semantic tier) skill renders canonical name + verdict in the answer prompt.

Lookup is therefore both a voice grounding source AND a qualification participant, in a single pass.

Multi-entity render

When the substring tier returns several distinct entries, the block prepends a multi-entity sub-block:

MULTIPLE ENTRIES MATCHED (visitor mentioned several):
  - Healthcare (industry_code HEALTH_002) — verdict covered
  - Education (industry_code EDU_004) — verdict covered
Ground on EACH entry above by name and industry_code. Do NOT drop any.
If a visitor question covers all of them, answer for every one —
concatenating the per-entry verdict.

The clear-winner profiling write is suppressed in this case (the resolver only writes when there's exactly one dominant entry), so qualification stays unfilled and a downstream 👉 can disambiguate if needed.

Configuration

Tables live in their own Postgres schema (lookup.tables + lookup.entries), not in config.client_configs. Manage them through the rose-config lookup CLI:

# Inline yaml fixture (hand-curated, small tables)
rose-config lookup import \
    --site <domain> \
    --mapping backend/apps/shared_data/lookup_mappings/<table>.yaml

# Tenant xlsx + sheet/column mapping yaml
rose-config lookup import \
    --site <domain> \
    --mapping <mapping.yaml> \
    --source <data.xlsx>

# List, debug-match
rose-config lookup list  --site <domain>
rose-config lookup match --site <domain> --table-id <table_id> --query "<text>"

Each table carries:

Field Purpose
table_id Stable handle, snake_case (e.g. supported_industries).
entry_term Data-model label (industry, country, integration).
visitor_topic Visitor-facing label rendered in the prompt (their industry or sector).
key_label What entry_id means to the visitor (industry_code, IDCC, ISO).
verdicts[] Allowed outcomes — value, label, response_hint, qualifies, disqualification_reason.
query_field Profiling field id whose extracted value drives the match query (optional). When set, the resolver matches on collected_values[query_field] instead of the raw visitor message. Leave unset when the visitor's verbatim mention IS the match query (e.g. they type the country name directly).
populates_field Qualification field to write on clear-winner match (optional). The named field becomes implicitly lookup-owned: early_profile_extractor strips it from its unfilled-fields list so the LLM extractor never produces it, leaving the lookup as the single source of truth.
disambiguation_enabled Fire the lookup_disambiguation skill on ambiguous semantic matches.

See backend/apps/shared_data/lookup_mappings/ for full examples (CCN xlsx import, testfeatures yaml inline).

Entry metadata — what NOT to put there

Per-entry metadata: {key: value} is freeform but reserved for entry-level descriptors (region tags, internal labels, locale codes). It does not carry landing-page URLs:

  • Sector / topic landing pages flow through content_recommendation (curated_content_groups + curated_content_items in Supabase). That subsystem is the single source of truth for "if visitor talks about X, surface this page".
  • Storing URLs in lookup metadata creates concept overlap with content_recommendation and primes the answer LLM to fabricate plausible paths when the field is unset. Don't.
  • If a tenant wants a sector landing page surfaced on lookup match, they configure it via curated content keyed to the same topic vocabulary. Lookup stays focused on verdict + canonical_name.

When to add a lookup table

Three conditions to meet:

  1. The client provides a structured list of "supported X" vs "not supported X" rows.
  2. Visitors will mention the topic (industry, country, integration, plan) and expect a verdict.
  3. The list is long (>10 entries) or expected to change over time.

If all three hold, a lookup table is strictly better than encoding the same information in a hand-written skill body: the matcher catches mentions the LLM would miss, the verdict is authoritative, and the data is editable without redeploying prompts.

Taxonomy

Vocabulary used throughout this page and across the lookup codebase.

Data shapes

Term What it is
Lookup table One tenant-scoped row in lookup.tables — metadata for a coverage concept (supported_industries, country_coverage, ccn_coverage). Identified by table_id.
Entry One row in lookup.entries — a specific value the table can return (e.g. Healthcare, Gambling). Identified by entry_id (e.g. HEALTH_002).
Alias Alternate phrasing a visitor might use for an entry (hospital, clinicHealthcare). Stored both raw and normalized.
Verdict Declared outcome the matcher can return for an entry (covered, uncovered, …). Each verdict carries a response_hint (voice) and qualification semantics (qualifies, disqualification_reason).

Naming surface

Term Used for Example
entry_term Data-model label for one row, used in tool descriptions. industry, country, convention collective
visitor_topic Visitor-facing label rendered in the prompt, framing what the visitor mentioned. their industry or business sector
key_label What entry_id represents to the visitor; surfaced verbatim. industry_code, IDCC, ISO code
canonical_name The entry's official name as stored. The agent must use it verbatim. Healthcare, Strategy Consulting

Match tiers

Term Meaning
id_exact Query contains the literal entry_id (e.g. visitor pastes HEALTH_002). Confidence 1.00.
name_exact / alias_exact Normalized query equals the canonical name or one of its aliases. Confidence 0.95.
name_substring / alias_substring Canonical name or alias appears as a substring of the query. Confidence 0.85 – 0.95. Multi-entity capable.
semantic pgvector cosine similarity against the entry's Q/A embedding. Confidence 0.85 – 0.95 after rescaling above the cutoff.
Deterministic tier Tiers 1–3. The match is literally in the query, so the verdict is authoritative.
Semantic tier Tier 4. The match is a paraphrase / self-ID hint; the answer LLM verifies before grounding.

Resolver output

Term Meaning
LookupSignal Resolver-produced summary written to RoseChatState.lookup_signal. Carries the top-1 fields PLUS the full candidates list, qualification semantics, and matched_query. One signal per turn — the runtime surface the answer node sees.
LookupCandidate One row inside LookupSignal.candidates. Used to render alternates and to detect multi-entity / ambiguity.
LookupMatchResult Per-tool RPC return value, shaped for an LLM tool-binding flow that is not currently wired into the answer node. Kept for a possible future where lookup tools are bound directly on the answer LLM; today's runtime path goes resolver → state → skill, not tools.

Render modes

Term When it fires Behavior
AUTHORITATIVE block Match came from a deterministic tier (1–3). Rendered from lookup_result_authoritative/SKILL.md. LLM must ground on the verdict; canonical name + key_label used verbatim.
CANDIDATE block Match came from the semantic tier (4). Rendered from lookup_result_candidate/SKILL.md. LLM verifies the match is the visitor's real topic before grounding. Default-reject on third-person / research framing.
Clear winner top.confidence ≥ 0.85 AND top - runner_up ≥ 0.01. Singular grounding + optional profiling write.
Multi-entity 2+ deterministic candidates with distinct entry_ids (e.g. "hospitals AND schools"). Every match rendered in the prompt; LLM grounds on each. No profiling write.
Ambiguous Several semantic candidates, no clear winner. When disambiguation_enabled=true on the table, lookup_disambiguation skill fires — agent lists candidates and asks 👉.

Qualification semantics

Term Meaning
qualifies Per-verdict CTA gate signal. true = demo allowed, false = demo blocked, null = neutral.
disqualification_reason Token stored when qualifies=false, fed into cta_blocking_fields / interest signals.
Signal latching A qualifies=false signal persists on state across follow-up turns until something explicitly overwrites it. Prevents a generic follow-up from accidentally re-enabling the demo CTA.
query_field Table-level setting. When set, the resolver builds its match query from ProfilingState.collected_values[query_field] (extracted upstream by early_profile_extractor) instead of the raw visitor message. When the value is empty, the table is skipped this turn.
populates_field Table-level setting. When a clear-winner match resolves, canonical_name is written into ProfilingState.collected_values[populates_field], satisfying that field for qualification without asking 👉.
Lookup-owned field Any profiling field appearing as populates_field on a bound lookup table. early_profile_extractor strips lookup-owned fields from its prompt so the LLM never writes them — the resolver is the single source of truth for the canonical value.

Behavioral guards

Term Meaning
Third-person guard Pre-resolver heuristic (looks_like_third_person) that short-circuits the matcher when the visitor uses research / curiosity / topical framing ("how do casinos use chatbots", "for a blog post I'm writing"). Prevents incidental mentions from producing a coverage verdict.
CANDIDATE verification Defensive prompt directive on semantic-tier hits. Default-reject unless the visitor is speaking AS / FROM the topic in first-person.
  • docs/src/backend/skills.md — generic skill-system primer (frontmatter, categories, selection).
  • Data skill: backend/apps/shared_data/prompts/website-agent/skills/system_context/lookup_data/SKILL.md (pure placeholder render).
  • Per-client decision skills (examples): clients/payfit.com/response_handling/industry_coverage/SKILL.md, clients/testfeatures.com/response_handling/industry_coverage/SKILL.md.
  • docs/src/agents/website-agent/qualification-flow.md — how qualifies / disqualification_reason flow into CTA gating.
  • docs/src/unified-config-system.md — config vs lookup boundary.
  • Source: backend/packages/ixchat/ixchat/tools/lookup/ (factory, models, normalize, renderer).
  • E2E eval: feature-lookup dataset, rose-eval features run lookup.