Lookup Tables¶
Overview¶
Lookup tables let the Website Agent answer structured-coverage questions against tenant-supplied data: supported industries, covered countries, available integrations, plan tiers, French CCNs, and so on. The agent matches what the visitor said against entries in the table, surfaces a verdict (covered, uncovered, …), and feeds qualification semantics into demo CTA gating.
Two production surfaces consume the same signal:
- Voice — the answer node grounds its reply on the canonical entry name + verdict + response hint.
- Qualification —
qualifies=true|falsewrites intocta_blocking_fields/ interest signals so the demo CTA respects deliberate exclusions (e.g. "we don't sell to Defense").
End-to-end flow¶
The resolver runs after early_profile_extractor (serial edge in the graph). The extractor is the upstream stage that produces industry, country, etc.; the resolver consumes those values as match queries via query_field. This decouples the lookup signal from the visitor's conversational framing ("can you support my paye?") and grounds it on a qualified upstream value ("éditeur de logiciel").
Match-tier hierarchy¶
The Postgres function lookup.match_entry walks four tiers top-to-bottom and returns the first non-empty result. All surviving matches have confidence >= 0.85.
| Tier | Type | Confidence | Limit | matched_via value |
|---|---|---|---|---|
| 1 | id_exact |
1.00 | 1 | id_exact |
| 2 | name_exact, alias_exact |
0.95 | 1 | name_exact / alias_exact |
| 3 | name_substring, alias_substring |
0.85 – 0.95 | 3 | name_substring / alias_substring |
| 4 | semantic (pgvector) |
0.85 – 0.95 | 3 | semantic |
Tiers 1–3 are deterministic — the matched name or alias is literally in the visitor message. Tier 4 is loose-recall semantic similarity. Both surface through the same lookup_data block; per-client response_handling skills decide whether to ground (typically first-person identification) or treat the match as background (typically third-person / curiosity framing). matched_via is exposed as a placeholder so client skills can branch on tier if needed.
Signal post-processing¶
After the matcher returns rows, the resolver builds one LookupSignal and classifies it into one of three render modes:
- Clear winner → one entry dominates. Singular fields drive the prompt;
populates_fieldwritescanonical_nameintoProfilingState.collected_valuesso downstream qualification can use it. - Multi-entity → the visitor mentioned several distinct entries on the substring tier (e.g. "we run hospitals AND schools" → Healthcare + Education). Every match is rendered inside the
lookup_datablock; per-client response_handling decides whether to ground on each. - Ambiguous → multiple semantic candidates, no clear top. The candidates list is rendered inside the
lookup_datablock under "alternates (no clear winner):"; per-client response_handling decides whether to ask the visitor to pick (typically whendisambiguation_enabled=trueon the table).
Voice surface — what the LLM sees¶
Lookup is a pure data layer. One global skill lookup_data renders the matched entry, verdict, candidates, and metadata as structured facts — no decision prose. Per-client response_handling skills (e.g. clients/payfit.com/response_handling/industry_coverage/SKILL.md) own the decisions: when to ground vs ask, opening-word mapping, CTA gating, tone, disambiguation framing.
| Layer | Where | What it does |
|---|---|---|
| Data | system_context/lookup_data/SKILL.md |
Renders {{canonical_name}}, {{verdict}}, {{confidence}}, {{candidates_list}}, {{multi_entity_block}}, … as facts. One data-integrity hint: "use canonical_name and key_label verbatim when grounding". |
| Decision | clients/<domain>/response_handling/industry_coverage/SKILL.md (or equivalent) |
Verdict↔opening-word mapping, first-person vs third-person framing, demo CTA gating, disambiguation prompt shape, language. |
_inject_lookup_data in skill_applier_rules.py fires the data skill whenever state.lookup_signal is present; per-tenant clients/<domain>/system_context/lookup_data/SKILL.md can override the data render itself, though clients normally only need a response_handling decision skill.
CTA gating happens through the qualification subsystem (qualifies / cta_blocking_fields), never through prompt wording in the data block.
Example rendered block:
=== LOOKUP TABLE DATA ===
Tool: lookup_supported_industries
Visitor message: "we're a small hospital, would your product work for us?"
matched entry: Healthcare (industry_code HEALTH_002)
verdict: covered
confidence: 0.910
matched_via: alias_substring
response_hint: Confirm coverage. Mention the canonical industry name and
industry_code. Then propose the demo CTA.
When grounding the reply on this data, use ``canonical_name`` and
``industry_code`` verbatim — paraphrasing makes the verdict ambiguous.
=== END LOOKUP TABLE DATA ===
When the resolver returns multiple candidates with no clear winner, the block adds an "alternates (no clear winner):" footer listing them; the per-client decision skill decides whether to ask the visitor to pick.
Qualification interplay¶
Lookup feeds two decisions on every turn it fires: how the agent talks about the entry (voice) AND whether the demo CTA is allowed (qualification). Three concepts compose to produce the qualification side.
1. qualifies flag¶
Every verdict declares whether matching it makes the visitor demo-eligible:
verdicts:
- value: covered
qualifies: true # Healthcare / SaaS / Retail → demo OK
- value: uncovered
qualifies: false # Gambling / Defense / Crypto → demo blocked
disqualification_reason: industry_not_supported
| Value | Effect |
|---|---|
true |
Positive signal — lookup confirms the visitor is in scope. |
false |
Blocker — disqualification_reason joins the visitor's blocking-fields set. Demo CTA suppressed even on explicit BOOKING intent later in the conversation. |
null |
Neutral — lookup hit but no opinion on demo gating. |
Concretely: visitor says "we run a casino" → uncovered → qualifies=false, disqualification_reason=industry_not_supported. The demo_offer skill sees the blocker and refuses to fire 👇.
2. Signal latching¶
The resolver writes LookupSignal into RoseChatState.lookup_signal. On the next turn it runs again — but if the new visitor message doesn't match anything, it leaves the prior signal in place.
| Turn | Visitor says | Resolver | state.lookup_signal |
|---|---|---|---|
| 1 | "we're a sports betting platform" | Match → Gambling, qualifies=false |
Gambling |
| 2 | "what integrations do you support?" | No match | Gambling (latched) |
| 3 | "ok book a demo" | No match | Gambling (latched) — demo refused |
Without latching, turn 2's "no match" would clear the signal and turn 3 could push the demo CTA — defeating the deliberate exclusion. Latching means a disqualification you collected ONCE survives until the visitor explicitly says something that changes it.
3. query_field and populates_field — lookup as a profiling stage¶
Qualification has 👉 questions that collect specific fields (team_size, use_case, industry, convention_collective, …). When cta_blocking_fields lists those, the demo CTA waits until they're collected. Fields normally get filled by:
- 👉 question → visitor answers →
profile_extractor(LLM) pulls the answer intoProfilingState.collected_values[field].
A lookup table can plug into that pipeline as either an input consumer (query_field) or an output producer (populates_field), or both. The two-stage pattern is:
table_id: ccn_coverage
query_field: industry # input — what the resolver matches on
populates_field: convention_collective # output — what a clear-winner match writes
early_profile_extractorruns first (LLM). It extractsindustryfrom the conversation.lookup_resolverruns next (serial edge). It readscollected_values["industry"]and runs the CCN matcher on that value, not on the raw visitor message.- On a clear-winner match, the resolver writes
collected_values["convention_collective"]with the canonical CCN name.
Two query-source modes coexist per table:
- Profiling-driven (
query_fieldset) — the resolver matches on the upstream extracted value. Use when the visitor's framing is noisy (industry mentioned inside a question about something else) and the extractor can pull out the concept reliably. - Raw-input (
query_fieldunset) — the resolver matches on the visitor's verbatim message and applies the third-person guard. Use when the visitor will literally name the entry (country, integration, plan tier) and the deterministic tiers can lift it straight from the text.
When query_field is set but the value is empty, the table is skipped for the turn — the resolver re-runs next turn once the extractor fills the upstream field.
Implicit lookup ownership¶
Any field that appears as populates_field on a bound lookup table is implicitly lookup-owned. early_profile_extractor queries lookup_owned_fields_for_domain(domain) and strips those fields from its unfilled-fields list before prompting the LLM. The lookup resolver becomes the single writer, so the canonical name stored in collected_values always matches a real entry in lookup.entries — never a paraphrase the LLM invented.
The clear-winner gate matters: ambiguous matches (e.g. industry → both Strategy Consulting and IT Consulting) DO NOT write profiling. The resolver can't tell which one to commit to, so the field stays empty and either lookup_disambiguation or a regular 👉 collects it.
Composition¶
On one visitor turn:
- Resolver runs → produces signal (or keeps the latched one).
- Signal carries
qualifies(CTA gate input) + optionalpopulates_fieldwrite (qualification input). demo_offerskill checks: any blocking field unsatisfied? anyqualifies=falsesignal active? → if either, no 👇.- Voice path:
lookup_result_authoritative(deterministic tier) orlookup_result_candidate(semantic tier) skill renders canonical name + verdict in the answer prompt.
Lookup is therefore both a voice grounding source AND a qualification participant, in a single pass.
Multi-entity render¶
When the substring tier returns several distinct entries, the block prepends a multi-entity sub-block:
MULTIPLE ENTRIES MATCHED (visitor mentioned several):
- Healthcare (industry_code HEALTH_002) — verdict covered
- Education (industry_code EDU_004) — verdict covered
Ground on EACH entry above by name and industry_code. Do NOT drop any.
If a visitor question covers all of them, answer for every one —
concatenating the per-entry verdict.
The clear-winner profiling write is suppressed in this case (the resolver only writes when there's exactly one dominant entry), so qualification stays unfilled and a downstream 👉 can disambiguate if needed.
Configuration¶
Tables live in their own Postgres schema (lookup.tables + lookup.entries), not in config.client_configs. Manage them through the rose-config lookup CLI:
# Inline yaml fixture (hand-curated, small tables)
rose-config lookup import \
--site <domain> \
--mapping backend/apps/shared_data/lookup_mappings/<table>.yaml
# Tenant xlsx + sheet/column mapping yaml
rose-config lookup import \
--site <domain> \
--mapping <mapping.yaml> \
--source <data.xlsx>
# List, debug-match
rose-config lookup list --site <domain>
rose-config lookup match --site <domain> --table-id <table_id> --query "<text>"
Each table carries:
| Field | Purpose |
|---|---|
table_id |
Stable handle, snake_case (e.g. supported_industries). |
entry_term |
Data-model label (industry, country, integration). |
visitor_topic |
Visitor-facing label rendered in the prompt (their industry or sector). |
key_label |
What entry_id means to the visitor (industry_code, IDCC, ISO). |
verdicts[] |
Allowed outcomes — value, label, response_hint, qualifies, disqualification_reason. |
query_field |
Profiling field id whose extracted value drives the match query (optional). When set, the resolver matches on collected_values[query_field] instead of the raw visitor message. Leave unset when the visitor's verbatim mention IS the match query (e.g. they type the country name directly). |
populates_field |
Qualification field to write on clear-winner match (optional). The named field becomes implicitly lookup-owned: early_profile_extractor strips it from its unfilled-fields list so the LLM extractor never produces it, leaving the lookup as the single source of truth. |
disambiguation_enabled |
Fire the lookup_disambiguation skill on ambiguous semantic matches. |
See backend/apps/shared_data/lookup_mappings/ for full examples (CCN xlsx import, testfeatures yaml inline).
Entry metadata — what NOT to put there¶
Per-entry metadata: {key: value} is freeform but reserved for entry-level descriptors (region tags, internal labels, locale codes). It does not carry landing-page URLs:
- Sector / topic landing pages flow through
content_recommendation(curated_content_groups + curated_content_items in Supabase). That subsystem is the single source of truth for "if visitor talks about X, surface this page". - Storing URLs in lookup metadata creates concept overlap with
content_recommendationand primes the answer LLM to fabricate plausible paths when the field is unset. Don't. - If a tenant wants a sector landing page surfaced on lookup match, they configure it via curated content keyed to the same topic vocabulary. Lookup stays focused on
verdict+canonical_name.
When to add a lookup table¶
Three conditions to meet:
- The client provides a structured list of "supported X" vs "not supported X" rows.
- Visitors will mention the topic (industry, country, integration, plan) and expect a verdict.
- The list is long (>10 entries) or expected to change over time.
If all three hold, a lookup table is strictly better than encoding the same information in a hand-written skill body: the matcher catches mentions the LLM would miss, the verdict is authoritative, and the data is editable without redeploying prompts.
Taxonomy¶
Vocabulary used throughout this page and across the lookup codebase.
Data shapes¶
| Term | What it is |
|---|---|
| Lookup table | One tenant-scoped row in lookup.tables — metadata for a coverage concept (supported_industries, country_coverage, ccn_coverage). Identified by table_id. |
| Entry | One row in lookup.entries — a specific value the table can return (e.g. Healthcare, Gambling). Identified by entry_id (e.g. HEALTH_002). |
| Alias | Alternate phrasing a visitor might use for an entry (hospital, clinic → Healthcare). Stored both raw and normalized. |
| Verdict | Declared outcome the matcher can return for an entry (covered, uncovered, …). Each verdict carries a response_hint (voice) and qualification semantics (qualifies, disqualification_reason). |
Naming surface¶
| Term | Used for | Example |
|---|---|---|
entry_term |
Data-model label for one row, used in tool descriptions. | industry, country, convention collective |
visitor_topic |
Visitor-facing label rendered in the prompt, framing what the visitor mentioned. | their industry or business sector |
key_label |
What entry_id represents to the visitor; surfaced verbatim. |
industry_code, IDCC, ISO code |
canonical_name |
The entry's official name as stored. The agent must use it verbatim. | Healthcare, Strategy Consulting |
Match tiers¶
| Term | Meaning |
|---|---|
id_exact |
Query contains the literal entry_id (e.g. visitor pastes HEALTH_002). Confidence 1.00. |
name_exact / alias_exact |
Normalized query equals the canonical name or one of its aliases. Confidence 0.95. |
name_substring / alias_substring |
Canonical name or alias appears as a substring of the query. Confidence 0.85 – 0.95. Multi-entity capable. |
semantic |
pgvector cosine similarity against the entry's Q/A embedding. Confidence 0.85 – 0.95 after rescaling above the cutoff. |
| Deterministic tier | Tiers 1–3. The match is literally in the query, so the verdict is authoritative. |
| Semantic tier | Tier 4. The match is a paraphrase / self-ID hint; the answer LLM verifies before grounding. |
Resolver output¶
| Term | Meaning |
|---|---|
LookupSignal |
Resolver-produced summary written to RoseChatState.lookup_signal. Carries the top-1 fields PLUS the full candidates list, qualification semantics, and matched_query. One signal per turn — the runtime surface the answer node sees. |
LookupCandidate |
One row inside LookupSignal.candidates. Used to render alternates and to detect multi-entity / ambiguity. |
LookupMatchResult |
Per-tool RPC return value, shaped for an LLM tool-binding flow that is not currently wired into the answer node. Kept for a possible future where lookup tools are bound directly on the answer LLM; today's runtime path goes resolver → state → skill, not tools. |
Render modes¶
| Term | When it fires | Behavior |
|---|---|---|
| AUTHORITATIVE block | Match came from a deterministic tier (1–3). Rendered from lookup_result_authoritative/SKILL.md. |
LLM must ground on the verdict; canonical name + key_label used verbatim. |
| CANDIDATE block | Match came from the semantic tier (4). Rendered from lookup_result_candidate/SKILL.md. |
LLM verifies the match is the visitor's real topic before grounding. Default-reject on third-person / research framing. |
| Clear winner | top.confidence ≥ 0.85 AND top - runner_up ≥ 0.01. |
Singular grounding + optional profiling write. |
| Multi-entity | 2+ deterministic candidates with distinct entry_ids (e.g. "hospitals AND schools"). |
Every match rendered in the prompt; LLM grounds on each. No profiling write. |
| Ambiguous | Several semantic candidates, no clear winner. | When disambiguation_enabled=true on the table, lookup_disambiguation skill fires — agent lists candidates and asks 👉. |
Qualification semantics¶
| Term | Meaning |
|---|---|
qualifies |
Per-verdict CTA gate signal. true = demo allowed, false = demo blocked, null = neutral. |
disqualification_reason |
Token stored when qualifies=false, fed into cta_blocking_fields / interest signals. |
| Signal latching | A qualifies=false signal persists on state across follow-up turns until something explicitly overwrites it. Prevents a generic follow-up from accidentally re-enabling the demo CTA. |
query_field |
Table-level setting. When set, the resolver builds its match query from ProfilingState.collected_values[query_field] (extracted upstream by early_profile_extractor) instead of the raw visitor message. When the value is empty, the table is skipped this turn. |
populates_field |
Table-level setting. When a clear-winner match resolves, canonical_name is written into ProfilingState.collected_values[populates_field], satisfying that field for qualification without asking 👉. |
| Lookup-owned field | Any profiling field appearing as populates_field on a bound lookup table. early_profile_extractor strips lookup-owned fields from its prompt so the LLM never writes them — the resolver is the single source of truth for the canonical value. |
Behavioral guards¶
| Term | Meaning |
|---|---|
| Third-person guard | Pre-resolver heuristic (looks_like_third_person) that short-circuits the matcher when the visitor uses research / curiosity / topical framing ("how do casinos use chatbots", "for a blog post I'm writing"). Prevents incidental mentions from producing a coverage verdict. |
| CANDIDATE verification | Defensive prompt directive on semantic-tier hits. Default-reject unless the visitor is speaking AS / FROM the topic in first-person. |
Related¶
docs/src/backend/skills.md— generic skill-system primer (frontmatter, categories, selection).- Data skill:
backend/apps/shared_data/prompts/website-agent/skills/system_context/lookup_data/SKILL.md(pure placeholder render). - Per-client decision skills (examples):
clients/payfit.com/response_handling/industry_coverage/SKILL.md,clients/testfeatures.com/response_handling/industry_coverage/SKILL.md. docs/src/agents/website-agent/qualification-flow.md— howqualifies/disqualification_reasonflow into CTA gating.docs/src/unified-config-system.md— config vs lookup boundary.- Source:
backend/packages/ixchat/ixchat/tools/lookup/(factory, models, normalize, renderer). - E2E eval:
feature-lookupdataset,rose-eval features run lookup.