IXRag Package¶

RAG (Retrieval-Augmented Generation) with LightRAG integration, document limiting, and reranking.

Overview¶

The ixrag package provides the retrieval layer for the Rose platform. It integrates with LightRAG to provide hybrid retrieval combining:

Text Chunks: Traditional vector similarity search
Entities: Knowledge graph entities extracted from documents
Relationships: Connections between entities in the knowledge graph

Key Components¶

LangGraph Retriever: Main retrieval orchestration with reranking
Document Limiter: Intelligent document allocation and limiting
Document Processor: Converts LightRAG responses to LangChain Documents
Multi-Tenant Support: Shared singleton with per-request tenant isolation via contextvars

Reranking & Limiting Configuration¶

Configuration is defined in environment config files (e.g., development.toml, staging.toml, production.toml).

Config Options¶

Option	Type	Default	Description
`mode`	string	`"mix"`	RAG mode: `global`, `local`, `hybrid`, `mix`
`rerank_enabled`	bool	`false`	Enable Cohere/Jina reranking
`limiter_enabled`	bool	`true`	Enable type-based limiting (fallback when reranking disabled)
`rerank_top_k`	int	`20`	Max documents returned (used by both reranking and limiting)
`rerank_provider`	string	`"cohere"`	Reranking provider: `cohere` or `jina`
`rerank_model`	string	`"rerank-v3.5"`	Model name for reranking
`relationships_enabled`	bool	`true`	Include relationship documents in results
`graph_top_k`	int	`60`	Initial retrieval limit for entities/relationships
`chunk_top_k`	int	`20`	Initial retrieval limit for chunks

Example Configuration¶

[lightrag]
mode = "mix"
rerank_enabled = true
limiter_enabled = true
rerank_top_k = 12
rerank_provider = "cohere"
rerank_model = "rerank-v3.5"
relationships_enabled = true
graph_top_k = 10
chunk_top_k = 15

Behavior Matrix¶

`rerank_enabled`	`limiter_enabled`	Behavior
`true`	-	Cohere/Jina reranks all documents together, returns top N by relevance
`false`	`true`	Type-based allocation by RAG mode (see below)
`false`	`false`	All documents returned without limiting

Type-Based Allocation¶

When rerank_enabled=false and limiter_enabled=true, documents are allocated by type based on the RAG mode:

Mode	Chunks	Entities	Relationships
`global`	30%	35%	35%
`local`	60%	20%	20%
`hybrid` / `mix`	30%	30%	40%

The allocation percentages determine how the rerank_top_k budget is distributed across document types.

Chunking Strategy¶

Knowledge base documents are split into chunks before they are embedded and indexed in Mongo (chunks_vdb) and Neo4j. The chunker that runs during ingestion is configured globally on every LightRAG instance and is the same one for both the API server's shared singleton and the document loader's per-tenant instances.

Default: heading-aware markdown chunking¶

The default chunker is markdown_aware_chunking (ixrag/lightrag/markdown_chunking.py). It replaces LightRAG's plain token-based splitter with markdown-structure awareness.

Behaviour summary:

Heading-bounded sections. A markdown heading (#…######) ends the current section and starts a new one. The heading hierarchy is tracked as a stack — a level-N heading pops the stack until the top is shallower than N, then pushes the new heading.
Heading-context prefix. Each chunk is prepended with [Section: H1 > H2 > …]\n\n so heading words participate in embedding similarity. Without this, a chunk that starts mid-section has no topical signal.
Long-section subdivision. When a section's body exceeds chunk_token_size, it is windowed with overlap (same math as upstream chunking_by_token_size). The prefix is repeated on every sub-chunk and the window size is shrunk by the prefix cost so the final chunk stays under the limit.
Code-fence safety. Lines inside ``` / ~~~ blocks are never treated as headings.
Fallback to upstream. If the content has no markdown headings — or the caller passed an explicit split_by_character — the chunker delegates to chunking_by_token_size. Plain-text docs behave exactly as before.

Anti-fragmentation guards¶

Pure heading splitting over-fragments on real-world KB docs (h4+ structural noise, near-empty sections). Two constants in markdown_chunking.py keep the output sane:

Constant	Default	Purpose
`MAX_SPLIT_DEPTH`	`3`	Headings deeper than h3 (`####+`) stay as plain body lines instead of starting a new chunk. Below h3 headings tend to be implementation noise (param tables, sub-notes).
`MIN_SECTION_TOKENS`	`16`	Sections whose body is shorter than 16 tokens back-merge into the previous chunk. The merged section's heading is preserved inline as a `### text` body line so its context is not lost. Surviving chunk's prefix collapses to the longest common ancestor of the two heading chains.

Tuning the chunker¶

The chunker is registered in two places — both LightRAG(...) constructor sites pass chunking_func=markdown_aware_chunking:

File	Path used by
`ixrag/lightrag/rag_instance_manager.py` (`_create_rag_instance`)	API server's shared singleton + legacy per-tenant CLI path
`ixrag/document_pipeline/storage_manager.py` (`DocumentProcessor._initialize_rag`)	`rose-document-loader` ingestion path

To change the chunking method:

Tweak existing thresholds. Adjust MAX_SPLIT_DEPTH or MIN_SECTION_TOKENS at the top of markdown_chunking.py. Re-run poetry run pytest packages/ixrag/tests/test_markdown_chunking.py -v and the KB sample sweep (see below) before merging.
Swap the algorithm entirely. Write a new function with the same signature:
```
def my_chunker(
    tokenizer: Tokenizer,
    content: str,
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    chunk_overlap_token_size: int = 100,
    chunk_token_size: int = 512,
) -> list[dict[str, Any]]:
    # return [{"content": str, "tokens": int, "chunk_order_index": int}, ...]
    ...
```
Then replace chunking_func=markdown_aware_chunking with chunking_func=my_chunker in both call sites above. Forgetting one causes the API server and the loader to disagree on chunk shape, which silently corrupts retrieval until the next full re-ingest. 3. Revert to upstream behaviour. Remove the chunking_func=… kwarg from both call sites; LightRAG falls back to its default chunking_by_token_size.

After any change, force-reingest a representative tenant to refresh the chunks:

cd backend
rose-document-loader update-tenant <tenant>.com --env test \
    --force-update-docs <doc-id>

The chunks_vdb collection in Mongo for that tenant + env will be repopulated with the new chunk shapes. See rose-document-loader for the full CLI reference.

Validating the chunker on real KB samples¶

When tuning thresholds, sanity-check the effect on multiple KBs at once. Existing *-kb-dump directories under data/ in sibling Conductor workspaces are good fixtures. A small script:

from pathlib import Path
from lightrag.utils import TiktokenTokenizer
from ixrag.lightrag.markdown_chunking import markdown_aware_chunking

tok = TiktokenTokenizer()
for kb in [...]:  # paths to kb-dump dirs
    for f in Path(kb).rglob("*.md"):
        text = f.read_text()
        # strip frontmatter (--- ... ---) before chunking
        chunks = markdown_aware_chunking(
            tok, text, chunk_token_size=512, chunk_overlap_token_size=100
        )
        # inspect len(chunks), token-size buckets, tiny-chunk ratio

Two metrics that matter:

Chunk count vs. file count — should grow with structural density, not explode (target ≤4× chunks per file on average).
Sub-32-token chunk ratio — orphan tiny chunks pollute retrieval. Target <5%.

Method reference¶

All in backend/packages/ixrag/ixrag/lightrag/markdown_chunking.py. Pipeline: parse → merge → emit. Only markdown_aware_chunking is exported.

Constants

Name	Value	Purpose
`_HEADING_RE`	`r"^(#{1,6})\s+(.+?)\s#\s*$"`	ATX heading matcher, strips trailing closing hashes.
`_CODE_FENCE_RE`	r"^\s*(```\|~~~)"	Fence delimiter, toggles `in_fence`.
`MAX_SPLIT_DEPTH`	`3`	Headings deeper stay inline as body.
`MIN_SECTION_TOKENS`	`16`	Sections smaller back-merge into previous.

_Section — dataclass with heading_chain: list[(level, text)] (hierarchy snapshot, empty for preamble) and body_lines: list[str].

_parse_sections(content, max_split_depth) -> list[_Section] — single-pass walker. Tracks fence state + heading stack. Headings inside fences or deeper than max_split_depth stay as body. On a split heading: pops stack until top level < N, pushes new, starts fresh section. Returns splits + 1 sections.

_format_prefix(heading_chain) -> str — [(1,"A"),(3,"B")] → "[Section: A > B]\n\n". Levels dropped, text only.

_common_prefix(a, b) — longest shared head of two chains. Used when back-merging collapses the surviving chain to the common ancestor so the prefix doesn't claim a heading it no longer owns.

_merge_undersized(sections, tokenizer, min_section_tokens) — folds tiny sections into prev. Per section: skip empty bodies; if body_tokens < min and merged non-empty → append blank line + reconstructed # heading + body to prev, reduce prev's chain via _common_prefix. Otherwise append. First section can't back-merge.

_has_markdown_headings(content) -> bool — cheap pre-flight, scans for first non-fenced heading. Lets the entrypoint short-circuit on plain text.

markdown_aware_chunking(...) — public entrypoint, matches LightRAG chunking_func signature:

def markdown_aware_chunking(
    tokenizer: Tokenizer,
    content: str,
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    chunk_overlap_token_size: int = 100,
    chunk_token_size: int = 512,
) -> list[dict[str, Any]]:

Returns [{"content", "tokens", "chunk_order_index"}, ...]. Control flow:

split_by_character set → delegate to chunking_by_token_size.
No headings → delegate to chunking_by_token_size.
Otherwise → _parse_sections → _merge_undersized → emit.

Emit per section: window = max(1, chunk_token_size - prefix_tokens). Short body → one chunk. Long body → slide window with step = window - overlap, decode each slice, repeat prefix, break when slice reaches end. tokens is recomputed on the final string (includes prefix).

Note: tokens in the returned dict is recomputed via tokenizer.encode(chunk_content) to reflect the final string including prefix, not just the body slice length.

Technical Details¶

Why We Call Cohere Ourselves¶

LightRAG has built-in reranking support, but it's bypassed when using only_need_context=True (which we use to get raw context without LLM generation). The retrieval flow is:

LightRAG Query: Call with only_need_context=True to get raw documents
Document Processing: Convert LightRAG response to LangChain Documents
Reranking (if enabled): Call Cohere/Jina to rerank all documents by query relevance
Limiting (fallback): Apply type-based allocation if reranking unavailable

This approach gives us:

Full control over the reranking process
Unified reranking across all document types (chunks, entities, relationships)
Ability to use the latest Cohere/Jina models

Document Types¶

Each document returned has a document_type in its metadata:

Type	Source	Description
`chunk`	Vector search	Text chunks from indexed documents
`entity`	Knowledge graph	Extracted entities (people, companies, concepts)
`relationship`	Knowledge graph	Connections between entities

Key Files¶

File	Description
`ixrag/lightrag/langgraph_retriever.py`	Main retriever with reranking logic
`ixrag/lightrag/document_limiter.py`	Document limiting and allocation strategies
`ixrag/lightrag/document_processor.py`	Converts LightRAG responses to Documents
`ixrag/lightrag/lightrag_llm.py`	LLM and reranking function factories
`ixrag/lightrag/rag_instance_manager.py`	Shared singleton + per-tenant instance management

Integration Points¶

System	Purpose
LightRAG	Hybrid retrieval (vector + graph)
MongoDB	Vector storage backend
Neo4j	Graph storage backend
Cohere/Jina	Document reranking
LangFuse	Observability and tracing

Usage¶

The retriever is typically accessed through ixchat, but can be used directly:

from ixrag.lightrag.langgraph_retriever import LangGraphRetriever

# Create retriever
retriever = LangGraphRetriever(
    site_name="example-site",
    rag_mode="mix",
)

# Retrieve documents
documents = await retriever.ainvoke("What are your pricing plans?")

# Each document has:
# - page_content: The text content
# - metadata: {document_type, source, rerank_score (if reranked)}

Shared Singleton Architecture (IX-1578)¶

Problem¶

Before IX-1578, every tenant got its own LightRAG instance. LightRAG initialization is expensive: it creates Neo4j drivers, MongoDB clients, loads embedding functions, and calls initialize_storages(). With more clients onboarding:

Slow cold-start: Each new tenant required full LightRAG init, adding seconds to TTFT on the first request.
Connection explosion: N tenants = N sets of connection pools.
Memory pressure: N large LightRAG objects cached in memory.

Solution¶

One shared LightRAG instance, with tenant identity injected per-request via ContextVar.

The key insight: LightRAG itself is stateless with respect to tenant identity. All tenant filtering happens in the storage layer. So instead of one instance per tenant, there's a single shared instance, and storage classes resolve tenant_id dynamically from the current async task's context.

API request for "hexa.com"
    |
    +-- get_tenant_context("hexa.com")      # pure string op, cached
    |       -> TenantContext(mongo="tenant_hexa_com", neo4j="hexa_com")
    |
    +-- set_tenant(ctx)                      # bind to current async task
    |
    +-- get_shared_rag_instance()            # return the one singleton
    |
    +-- rag.aquery(...)
            |
            +-- Neo4jStorage.tenant_id -> get_tenant() -> "hexa_com"
            |       -> WHERE n.tenantId = "hexa_com"
            |
            +-- MongoStorage.tenant_id -> get_tenant() -> "tenant_hexa_com"
                    -> {"tenantId": "tenant_hexa_com"}

Components¶

TenantContext (ixinfra/tenant_context.py): A ContextVar providing task-local tenant identity in async code. Each asyncio.Task gets its own copy. Lives in ixinfra because it's a pure dataclass with zero storage imports.

@dataclass(frozen=True, slots=True)
class TenantContext:
    company_name: str
    mongo_tenant_id: str   # e.g., "tenant_hexa_com"
    neo4j_tenant_id: str   # e.g., "hexa_com"

Singleton Factory (ixrag/lightrag/rag_instance_manager.py): Two code paths:

Path	Function	Use Case
API server	`get_shared_rag_instance()`	Returns the single shared instance. Double-checked locking.
CLI / doc loader	`get_rag_instance(working_dir, key)`	Per-tenant instances for write operations (document indexing).

get_tenant_context() replaces old DB round-trips: previously computing a tenant ID required opening a synchronous Neo4j connection just to sanitize a string. Now it's a pure in-process string operation, cached in _tenant_context_cache.

Dual-Mode tenant_id Property: Both TenantAwareNeo4JStorage and BaseTenantAwareStorage (MongoDB) resolve tenant identity dynamically:

@property
def tenant_id(self) -> str:
    ctx = get_tenant()           # check contextvar
    if ctx is not None:
        return ctx.neo4j_tenant_id  # API server: contextvar wins
    return self._tenant_id          # CLI/test: instance attribute fallback

Write Guard: The shared singleton is read-only. If a write is attempted through it, upsert_node and upsert_edge in TenantAwareNeo4JStorage raise RuntimeError when the contextvar tenant mismatches the instance tenant ("__shared__").

Neo4j Driver Manager (ixneo4j/driver_manager.py): A ref-counted singleton AsyncDriver. Even with multiple LightRAG instances (CLI path), there's only one Neo4j connection pool per process.

Service Orchestration (ixchat/service.py): IXChatbotService ties everything together — calls get_tenant_context(), set_tenant(), then get_shared_rag_instance(). The eviction system only evicts lightweight per-tenant chatbot wrappers; the shared singleton is never evicted.

Tenant Isolation¶

Both Neo4j and MongoDB enforce isolation at every query:

Neo4j: All MATCH queries inject WHERE n.tenantId = $_tenant_id via _add_tenant_filter_to_query(). Write operations buffer nodes/edges then flush with UNWIND MERGE ... tenantId = $tenant_id.
MongoDB: All queries add {"tenantId": self.tenant_id} filter. Document IDs are composite ("tenant_hexa_com:original-chunk-id") to prevent collisions across tenants.

Key Files¶

File	Description
`ixinfra/tenant_context.py`	`TenantContext` dataclass + `ContextVar`
`ixrag/lightrag/rag_instance_manager.py`	Shared singleton + per-tenant factory
`ixneo4j/tenant_storage.py`	Dual-mode `tenant_id`, write guard
`ixmongo/tenant_storage.py`	Dual-mode `tenant_id` for MongoDB
`ixneo4j/driver_manager.py`	Ref-counted Neo4j `AsyncDriver` singleton
`ixchat/service.py`	`IXChatbotService` orchestration + eviction

Data Consistency¶

The package includes tools for monitoring and maintaining consistency between MongoDB and Neo4j storage backends. See the ixrag/lightrag/ directory for:

cli_consistency_check.py - Check data consistency
reconcile_entities.py - Sync missing entities
diagnose_entity_mapping.py - Diagnose entity name issues