Skip to content

IXRag Package

RAG (Retrieval-Augmented Generation) with LightRAG integration, document limiting, and reranking.

Overview

The ixrag package provides the retrieval layer for the Rose platform. It integrates with LightRAG to provide hybrid retrieval combining:

  • Text Chunks: Traditional vector similarity search
  • Entities: Knowledge graph entities extracted from documents
  • Relationships: Connections between entities in the knowledge graph

Key Components

  • LangGraph Retriever: Main retrieval orchestration with reranking
  • Document Limiter: Intelligent document allocation and limiting
  • Document Processor: Converts LightRAG responses to LangChain Documents
  • Multi-Tenant Support: Shared singleton with per-request tenant isolation via contextvars

Reranking & Limiting Configuration

Configuration is defined in environment config files (e.g., development.toml, staging.toml, production.toml).

Config Options

Option Type Default Description
mode string "mix" RAG mode: global, local, hybrid, mix
rerank_enabled bool false Enable Cohere/Jina reranking
limiter_enabled bool true Enable type-based limiting (fallback when reranking disabled)
rerank_top_k int 20 Max documents returned (used by both reranking and limiting)
rerank_provider string "cohere" Reranking provider: cohere or jina
rerank_model string "rerank-v3.5" Model name for reranking
relationships_enabled bool true Include relationship documents in results
graph_top_k int 60 Initial retrieval limit for entities/relationships
chunk_top_k int 20 Initial retrieval limit for chunks

Example Configuration

[lightrag]
mode = "mix"
rerank_enabled = true
limiter_enabled = true
rerank_top_k = 12
rerank_provider = "cohere"
rerank_model = "rerank-v3.5"
relationships_enabled = true
graph_top_k = 10
chunk_top_k = 15

Behavior Matrix

rerank_enabled limiter_enabled Behavior
true - Cohere/Jina reranks all documents together, returns top N by relevance
false true Type-based allocation by RAG mode (see below)
false false All documents returned without limiting

Type-Based Allocation

When rerank_enabled=false and limiter_enabled=true, documents are allocated by type based on the RAG mode:

Mode Chunks Entities Relationships
global 30% 35% 35%
local 60% 20% 20%
hybrid / mix 30% 30% 40%

The allocation percentages determine how the rerank_top_k budget is distributed across document types.

Chunking Strategy

Knowledge base documents are split into chunks before they are embedded and indexed in Mongo (chunks_vdb) and Neo4j. The chunker that runs during ingestion is configured globally on every LightRAG instance and is the same one for both the API server's shared singleton and the document loader's per-tenant instances.

Default: heading-aware markdown chunking

The default chunker is markdown_aware_chunking (ixrag/lightrag/markdown_chunking.py). It replaces LightRAG's plain token-based splitter with markdown-structure awareness.

Behaviour summary:

  1. Heading-bounded sections. A markdown heading (#######) ends the current section and starts a new one. The heading hierarchy is tracked as a stack — a level-N heading pops the stack until the top is shallower than N, then pushes the new heading.
  2. Heading-context prefix. Each chunk is prepended with [Section: H1 > H2 > …]\n\n so heading words participate in embedding similarity. Without this, a chunk that starts mid-section has no topical signal.
  3. Long-section subdivision. When a section's body exceeds chunk_token_size, it is windowed with overlap (same math as upstream chunking_by_token_size). The prefix is repeated on every sub-chunk and the window size is shrunk by the prefix cost so the final chunk stays under the limit.
  4. Code-fence safety. Lines inside ``` / ~~~ blocks are never treated as headings.
  5. Fallback to upstream. If the content has no markdown headings — or the caller passed an explicit split_by_character — the chunker delegates to chunking_by_token_size. Plain-text docs behave exactly as before.

Anti-fragmentation guards

Pure heading splitting over-fragments on real-world KB docs (h4+ structural noise, near-empty sections). Two constants in markdown_chunking.py keep the output sane:

Constant Default Purpose
MAX_SPLIT_DEPTH 3 Headings deeper than h3 (####+) stay as plain body lines instead of starting a new chunk. Below h3 headings tend to be implementation noise (param tables, sub-notes).
MIN_SECTION_TOKENS 16 Sections whose body is shorter than 16 tokens back-merge into the previous chunk. The merged section's heading is preserved inline as a ### text body line so its context is not lost. Surviving chunk's prefix collapses to the longest common ancestor of the two heading chains.

Tuning the chunker

The chunker is registered in two places — both LightRAG(...) constructor sites pass chunking_func=markdown_aware_chunking:

File Path used by
ixrag/lightrag/rag_instance_manager.py (_create_rag_instance) API server's shared singleton + legacy per-tenant CLI path
ixrag/document_pipeline/storage_manager.py (DocumentProcessor._initialize_rag) rose-document-loader ingestion path

To change the chunking method:

  1. Tweak existing thresholds. Adjust MAX_SPLIT_DEPTH or MIN_SECTION_TOKENS at the top of markdown_chunking.py. Re-run poetry run pytest packages/ixrag/tests/test_markdown_chunking.py -v and the KB sample sweep (see below) before merging.
  2. Swap the algorithm entirely. Write a new function with the same signature:

    def my_chunker(
        tokenizer: Tokenizer,
        content: str,
        split_by_character: str | None = None,
        split_by_character_only: bool = False,
        chunk_overlap_token_size: int = 100,
        chunk_token_size: int = 512,
    ) -> list[dict[str, Any]]:
        # return [{"content": str, "tokens": int, "chunk_order_index": int}, ...]
        ...
    

    Then replace chunking_func=markdown_aware_chunking with chunking_func=my_chunker in both call sites above. Forgetting one causes the API server and the loader to disagree on chunk shape, which silently corrupts retrieval until the next full re-ingest. 3. Revert to upstream behaviour. Remove the chunking_func=… kwarg from both call sites; LightRAG falls back to its default chunking_by_token_size.

After any change, force-reingest a representative tenant to refresh the chunks:

cd backend
rose-document-loader update-tenant <tenant>.com --env test \
    --force-update-docs <doc-id>

The chunks_vdb collection in Mongo for that tenant + env will be repopulated with the new chunk shapes. See rose-document-loader for the full CLI reference.

Validating the chunker on real KB samples

When tuning thresholds, sanity-check the effect on multiple KBs at once. Existing *-kb-dump directories under data/ in sibling Conductor workspaces are good fixtures. A small script:

from pathlib import Path
from lightrag.utils import TiktokenTokenizer
from ixrag.lightrag.markdown_chunking import markdown_aware_chunking

tok = TiktokenTokenizer()
for kb in [...]:  # paths to kb-dump dirs
    for f in Path(kb).rglob("*.md"):
        text = f.read_text()
        # strip frontmatter (--- ... ---) before chunking
        chunks = markdown_aware_chunking(
            tok, text, chunk_token_size=512, chunk_overlap_token_size=100
        )
        # inspect len(chunks), token-size buckets, tiny-chunk ratio

Two metrics that matter:

  • Chunk count vs. file count — should grow with structural density, not explode (target ≤4× chunks per file on average).
  • Sub-32-token chunk ratio — orphan tiny chunks pollute retrieval. Target <5%.

Method reference

All in backend/packages/ixrag/ixrag/lightrag/markdown_chunking.py. Pipeline: parse → merge → emit. Only markdown_aware_chunking is exported.

Constants

Name Value Purpose
_HEADING_RE r"^(#{1,6})\s+(.+?)\s*#*\s*$" ATX heading matcher, strips trailing closing hashes.
_CODE_FENCE_RE r"^\s*(```|~~~)" Fence delimiter, toggles in_fence.
MAX_SPLIT_DEPTH 3 Headings deeper stay inline as body.
MIN_SECTION_TOKENS 16 Sections smaller back-merge into previous.

_Section — dataclass with heading_chain: list[(level, text)] (hierarchy snapshot, empty for preamble) and body_lines: list[str].

_parse_sections(content, max_split_depth) -> list[_Section] — single-pass walker. Tracks fence state + heading stack. Headings inside fences or deeper than max_split_depth stay as body. On a split heading: pops stack until top level < N, pushes new, starts fresh section. Returns splits + 1 sections.

_format_prefix(heading_chain) -> str[(1,"A"),(3,"B")]"[Section: A > B]\n\n". Levels dropped, text only.

_common_prefix(a, b) — longest shared head of two chains. Used when back-merging collapses the surviving chain to the common ancestor so the prefix doesn't claim a heading it no longer owns.

_merge_undersized(sections, tokenizer, min_section_tokens) — folds tiny sections into prev. Per section: skip empty bodies; if body_tokens < min and merged non-empty → append blank line + reconstructed # heading + body to prev, reduce prev's chain via _common_prefix. Otherwise append. First section can't back-merge.

_has_markdown_headings(content) -> bool — cheap pre-flight, scans for first non-fenced heading. Lets the entrypoint short-circuit on plain text.

markdown_aware_chunking(...) — public entrypoint, matches LightRAG chunking_func signature:

def markdown_aware_chunking(
    tokenizer: Tokenizer,
    content: str,
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    chunk_overlap_token_size: int = 100,
    chunk_token_size: int = 512,
) -> list[dict[str, Any]]:

Returns [{"content", "tokens", "chunk_order_index"}, ...]. Control flow:

  1. split_by_character set → delegate to chunking_by_token_size.
  2. No headings → delegate to chunking_by_token_size.
  3. Otherwise → _parse_sections_merge_undersized → emit.

Emit per section: window = max(1, chunk_token_size - prefix_tokens). Short body → one chunk. Long body → slide window with step = window - overlap, decode each slice, repeat prefix, break when slice reaches end. tokens is recomputed on the final string (includes prefix).

Note: tokens in the returned dict is recomputed via tokenizer.encode(chunk_content) to reflect the final string including prefix, not just the body slice length.

Technical Details

Why We Call Cohere Ourselves

LightRAG has built-in reranking support, but it's bypassed when using only_need_context=True (which we use to get raw context without LLM generation). The retrieval flow is:

  1. LightRAG Query: Call with only_need_context=True to get raw documents
  2. Document Processing: Convert LightRAG response to LangChain Documents
  3. Reranking (if enabled): Call Cohere/Jina to rerank all documents by query relevance
  4. Limiting (fallback): Apply type-based allocation if reranking unavailable

This approach gives us:

  • Full control over the reranking process
  • Unified reranking across all document types (chunks, entities, relationships)
  • Ability to use the latest Cohere/Jina models

Document Types

Each document returned has a document_type in its metadata:

Type Source Description
chunk Vector search Text chunks from indexed documents
entity Knowledge graph Extracted entities (people, companies, concepts)
relationship Knowledge graph Connections between entities

Key Files

File Description
ixrag/lightrag/langgraph_retriever.py Main retriever with reranking logic
ixrag/lightrag/document_limiter.py Document limiting and allocation strategies
ixrag/lightrag/document_processor.py Converts LightRAG responses to Documents
ixrag/lightrag/lightrag_llm.py LLM and reranking function factories
ixrag/lightrag/rag_instance_manager.py Shared singleton + per-tenant instance management

Integration Points

System Purpose
LightRAG Hybrid retrieval (vector + graph)
MongoDB Vector storage backend
Neo4j Graph storage backend
Cohere/Jina Document reranking
LangFuse Observability and tracing

Usage

The retriever is typically accessed through ixchat, but can be used directly:

from ixrag.lightrag.langgraph_retriever import LangGraphRetriever

# Create retriever
retriever = LangGraphRetriever(
    site_name="example-site",
    rag_mode="mix",
)

# Retrieve documents
documents = await retriever.ainvoke("What are your pricing plans?")

# Each document has:
# - page_content: The text content
# - metadata: {document_type, source, rerank_score (if reranked)}

Shared Singleton Architecture (IX-1578)

Problem

Before IX-1578, every tenant got its own LightRAG instance. LightRAG initialization is expensive: it creates Neo4j drivers, MongoDB clients, loads embedding functions, and calls initialize_storages(). With more clients onboarding:

  • Slow cold-start: Each new tenant required full LightRAG init, adding seconds to TTFT on the first request.
  • Connection explosion: N tenants = N sets of connection pools.
  • Memory pressure: N large LightRAG objects cached in memory.

Solution

One shared LightRAG instance, with tenant identity injected per-request via ContextVar.

The key insight: LightRAG itself is stateless with respect to tenant identity. All tenant filtering happens in the storage layer. So instead of one instance per tenant, there's a single shared instance, and storage classes resolve tenant_id dynamically from the current async task's context.

API request for "hexa.com"
    |
    +-- get_tenant_context("hexa.com")      # pure string op, cached
    |       -> TenantContext(mongo="tenant_hexa_com", neo4j="hexa_com")
    |
    +-- set_tenant(ctx)                      # bind to current async task
    |
    +-- get_shared_rag_instance()            # return the one singleton
    |
    +-- rag.aquery(...)
            |
            +-- Neo4jStorage.tenant_id -> get_tenant() -> "hexa_com"
            |       -> WHERE n.tenantId = "hexa_com"
            |
            +-- MongoStorage.tenant_id -> get_tenant() -> "tenant_hexa_com"
                    -> {"tenantId": "tenant_hexa_com"}

Components

TenantContext (ixinfra/tenant_context.py): A ContextVar providing task-local tenant identity in async code. Each asyncio.Task gets its own copy. Lives in ixinfra because it's a pure dataclass with zero storage imports.

@dataclass(frozen=True, slots=True)
class TenantContext:
    company_name: str
    mongo_tenant_id: str   # e.g., "tenant_hexa_com"
    neo4j_tenant_id: str   # e.g., "hexa_com"

Singleton Factory (ixrag/lightrag/rag_instance_manager.py): Two code paths:

Path Function Use Case
API server get_shared_rag_instance() Returns the single shared instance. Double-checked locking.
CLI / doc loader get_rag_instance(working_dir, key) Per-tenant instances for write operations (document indexing).

get_tenant_context() replaces old DB round-trips: previously computing a tenant ID required opening a synchronous Neo4j connection just to sanitize a string. Now it's a pure in-process string operation, cached in _tenant_context_cache.

Dual-Mode tenant_id Property: Both TenantAwareNeo4JStorage and BaseTenantAwareStorage (MongoDB) resolve tenant identity dynamically:

@property
def tenant_id(self) -> str:
    ctx = get_tenant()           # check contextvar
    if ctx is not None:
        return ctx.neo4j_tenant_id  # API server: contextvar wins
    return self._tenant_id          # CLI/test: instance attribute fallback

Write Guard: The shared singleton is read-only. If a write is attempted through it, upsert_node and upsert_edge in TenantAwareNeo4JStorage raise RuntimeError when the contextvar tenant mismatches the instance tenant ("__shared__").

Neo4j Driver Manager (ixneo4j/driver_manager.py): A ref-counted singleton AsyncDriver. Even with multiple LightRAG instances (CLI path), there's only one Neo4j connection pool per process.

Service Orchestration (ixchat/service.py): IXChatbotService ties everything together — calls get_tenant_context(), set_tenant(), then get_shared_rag_instance(). The eviction system only evicts lightweight per-tenant chatbot wrappers; the shared singleton is never evicted.

Tenant Isolation

Both Neo4j and MongoDB enforce isolation at every query:

  • Neo4j: All MATCH queries inject WHERE n.tenantId = $_tenant_id via _add_tenant_filter_to_query(). Write operations buffer nodes/edges then flush with UNWIND MERGE ... tenantId = $tenant_id.
  • MongoDB: All queries add {"tenantId": self.tenant_id} filter. Document IDs are composite ("tenant_hexa_com:original-chunk-id") to prevent collisions across tenants.

Key Files

File Description
ixinfra/tenant_context.py TenantContext dataclass + ContextVar
ixrag/lightrag/rag_instance_manager.py Shared singleton + per-tenant factory
ixneo4j/tenant_storage.py Dual-mode tenant_id, write guard
ixmongo/tenant_storage.py Dual-mode tenant_id for MongoDB
ixneo4j/driver_manager.py Ref-counted Neo4j AsyncDriver singleton
ixchat/service.py IXChatbotService orchestration + eviction

Data Consistency

The package includes tools for monitoring and maintaining consistency between MongoDB and Neo4j storage backends. See the ixrag/lightrag/ directory for:

  • cli_consistency_check.py - Check data consistency
  • reconcile_entities.py - Sync missing entities
  • diagnose_entity_mapping.py - Diagnose entity name issues