IXRag Package¶
RAG (Retrieval-Augmented Generation) with LightRAG integration, document limiting, and reranking.
Overview¶
The ixrag package provides the retrieval layer for the Rose platform. It integrates with LightRAG to provide hybrid retrieval combining:
- Text Chunks: Traditional vector similarity search
- Entities: Knowledge graph entities extracted from documents
- Relationships: Connections between entities in the knowledge graph
Key Components¶
- LangGraph Retriever: Main retrieval orchestration with reranking
- Document Limiter: Intelligent document allocation and limiting
- Document Processor: Converts LightRAG responses to LangChain Documents
- Multi-Tenant Support: Shared singleton with per-request tenant isolation via
contextvars
Reranking & Limiting Configuration¶
Configuration is defined in environment config files (e.g., development.toml, staging.toml, production.toml).
Config Options¶
| Option | Type | Default | Description |
|---|---|---|---|
mode |
string | "mix" |
RAG mode: global, local, hybrid, mix |
rerank_enabled |
bool | false |
Enable Cohere/Jina reranking |
limiter_enabled |
bool | true |
Enable type-based limiting (fallback when reranking disabled) |
rerank_top_k |
int | 20 |
Max documents returned (used by both reranking and limiting) |
rerank_provider |
string | "cohere" |
Reranking provider: cohere or jina |
rerank_model |
string | "rerank-v3.5" |
Model name for reranking |
relationships_enabled |
bool | true |
Include relationship documents in results |
graph_top_k |
int | 60 |
Initial retrieval limit for entities/relationships |
chunk_top_k |
int | 20 |
Initial retrieval limit for chunks |
Example Configuration¶
[lightrag]
mode = "mix"
rerank_enabled = true
limiter_enabled = true
rerank_top_k = 12
rerank_provider = "cohere"
rerank_model = "rerank-v3.5"
relationships_enabled = true
graph_top_k = 10
chunk_top_k = 15
Behavior Matrix¶
rerank_enabled |
limiter_enabled |
Behavior |
|---|---|---|
true |
- | Cohere/Jina reranks all documents together, returns top N by relevance |
false |
true |
Type-based allocation by RAG mode (see below) |
false |
false |
All documents returned without limiting |
Type-Based Allocation¶
When rerank_enabled=false and limiter_enabled=true, documents are allocated by type based on the RAG mode:
| Mode | Chunks | Entities | Relationships |
|---|---|---|---|
global |
30% | 35% | 35% |
local |
60% | 20% | 20% |
hybrid / mix |
30% | 30% | 40% |
The allocation percentages determine how the rerank_top_k budget is distributed across document types.
Chunking Strategy¶
Knowledge base documents are split into chunks before they are embedded and indexed in Mongo (chunks_vdb) and Neo4j. The chunker that runs during ingestion is configured globally on every LightRAG instance and is the same one for both the API server's shared singleton and the document loader's per-tenant instances.
Default: heading-aware markdown chunking¶
The default chunker is markdown_aware_chunking (ixrag/lightrag/markdown_chunking.py). It replaces LightRAG's plain token-based splitter with markdown-structure awareness.
Behaviour summary:
- Heading-bounded sections. A markdown heading (
#…######) ends the current section and starts a new one. The heading hierarchy is tracked as a stack — a level-N heading pops the stack until the top is shallower than N, then pushes the new heading. - Heading-context prefix. Each chunk is prepended with
[Section: H1 > H2 > …]\n\nso heading words participate in embedding similarity. Without this, a chunk that starts mid-section has no topical signal. - Long-section subdivision. When a section's body exceeds
chunk_token_size, it is windowed with overlap (same math as upstreamchunking_by_token_size). The prefix is repeated on every sub-chunk and the window size is shrunk by the prefix cost so the final chunk stays under the limit. - Code-fence safety. Lines inside
```/~~~blocks are never treated as headings. - Fallback to upstream. If the content has no markdown headings — or the caller passed an explicit
split_by_character— the chunker delegates tochunking_by_token_size. Plain-text docs behave exactly as before.
Anti-fragmentation guards¶
Pure heading splitting over-fragments on real-world KB docs (h4+ structural noise, near-empty sections). Two constants in markdown_chunking.py keep the output sane:
| Constant | Default | Purpose |
|---|---|---|
MAX_SPLIT_DEPTH |
3 |
Headings deeper than h3 (####+) stay as plain body lines instead of starting a new chunk. Below h3 headings tend to be implementation noise (param tables, sub-notes). |
MIN_SECTION_TOKENS |
16 |
Sections whose body is shorter than 16 tokens back-merge into the previous chunk. The merged section's heading is preserved inline as a ### text body line so its context is not lost. Surviving chunk's prefix collapses to the longest common ancestor of the two heading chains. |
Tuning the chunker¶
The chunker is registered in two places — both LightRAG(...) constructor sites pass chunking_func=markdown_aware_chunking:
| File | Path used by |
|---|---|
ixrag/lightrag/rag_instance_manager.py (_create_rag_instance) |
API server's shared singleton + legacy per-tenant CLI path |
ixrag/document_pipeline/storage_manager.py (DocumentProcessor._initialize_rag) |
rose-document-loader ingestion path |
To change the chunking method:
- Tweak existing thresholds. Adjust
MAX_SPLIT_DEPTHorMIN_SECTION_TOKENSat the top ofmarkdown_chunking.py. Re-runpoetry run pytest packages/ixrag/tests/test_markdown_chunking.py -vand the KB sample sweep (see below) before merging. -
Swap the algorithm entirely. Write a new function with the same signature:
def my_chunker( tokenizer: Tokenizer, content: str, split_by_character: str | None = None, split_by_character_only: bool = False, chunk_overlap_token_size: int = 100, chunk_token_size: int = 512, ) -> list[dict[str, Any]]: # return [{"content": str, "tokens": int, "chunk_order_index": int}, ...] ...Then replace
chunking_func=markdown_aware_chunkingwithchunking_func=my_chunkerin both call sites above. Forgetting one causes the API server and the loader to disagree on chunk shape, which silently corrupts retrieval until the next full re-ingest. 3. Revert to upstream behaviour. Remove thechunking_func=…kwarg from both call sites; LightRAG falls back to its defaultchunking_by_token_size.
After any change, force-reingest a representative tenant to refresh the chunks:
cd backend
rose-document-loader update-tenant <tenant>.com --env test \
--force-update-docs <doc-id>
The chunks_vdb collection in Mongo for that tenant + env will be repopulated with the new chunk shapes. See rose-document-loader for the full CLI reference.
Validating the chunker on real KB samples¶
When tuning thresholds, sanity-check the effect on multiple KBs at once. Existing *-kb-dump directories under data/ in sibling Conductor workspaces are good fixtures. A small script:
from pathlib import Path
from lightrag.utils import TiktokenTokenizer
from ixrag.lightrag.markdown_chunking import markdown_aware_chunking
tok = TiktokenTokenizer()
for kb in [...]: # paths to kb-dump dirs
for f in Path(kb).rglob("*.md"):
text = f.read_text()
# strip frontmatter (--- ... ---) before chunking
chunks = markdown_aware_chunking(
tok, text, chunk_token_size=512, chunk_overlap_token_size=100
)
# inspect len(chunks), token-size buckets, tiny-chunk ratio
Two metrics that matter:
- Chunk count vs. file count — should grow with structural density, not explode (target ≤4× chunks per file on average).
- Sub-32-token chunk ratio — orphan tiny chunks pollute retrieval. Target <5%.
Method reference¶
All in backend/packages/ixrag/ixrag/lightrag/markdown_chunking.py. Pipeline: parse → merge → emit. Only markdown_aware_chunking is exported.
Constants
| Name | Value | Purpose |
|---|---|---|
_HEADING_RE |
r"^(#{1,6})\s+(.+?)\s*#*\s*$" |
ATX heading matcher, strips trailing closing hashes. |
_CODE_FENCE_RE |
r"^\s*(```|~~~)" |
Fence delimiter, toggles in_fence. |
MAX_SPLIT_DEPTH |
3 |
Headings deeper stay inline as body. |
MIN_SECTION_TOKENS |
16 |
Sections smaller back-merge into previous. |
_Section — dataclass with heading_chain: list[(level, text)] (hierarchy snapshot, empty for preamble) and body_lines: list[str].
_parse_sections(content, max_split_depth) -> list[_Section] — single-pass walker. Tracks fence state + heading stack. Headings inside fences or deeper than max_split_depth stay as body. On a split heading: pops stack until top level < N, pushes new, starts fresh section. Returns splits + 1 sections.
_format_prefix(heading_chain) -> str — [(1,"A"),(3,"B")] → "[Section: A > B]\n\n". Levels dropped, text only.
_common_prefix(a, b) — longest shared head of two chains. Used when back-merging collapses the surviving chain to the common ancestor so the prefix doesn't claim a heading it no longer owns.
_merge_undersized(sections, tokenizer, min_section_tokens) — folds tiny sections into prev. Per section: skip empty bodies; if body_tokens < min and merged non-empty → append blank line + reconstructed # heading + body to prev, reduce prev's chain via _common_prefix. Otherwise append. First section can't back-merge.
_has_markdown_headings(content) -> bool — cheap pre-flight, scans for first non-fenced heading. Lets the entrypoint short-circuit on plain text.
markdown_aware_chunking(...) — public entrypoint, matches LightRAG chunking_func signature:
def markdown_aware_chunking(
tokenizer: Tokenizer,
content: str,
split_by_character: str | None = None,
split_by_character_only: bool = False,
chunk_overlap_token_size: int = 100,
chunk_token_size: int = 512,
) -> list[dict[str, Any]]:
Returns [{"content", "tokens", "chunk_order_index"}, ...]. Control flow:
split_by_characterset → delegate tochunking_by_token_size.- No headings → delegate to
chunking_by_token_size. - Otherwise →
_parse_sections→_merge_undersized→ emit.
Emit per section: window = max(1, chunk_token_size - prefix_tokens). Short body → one chunk. Long body → slide window with step = window - overlap, decode each slice, repeat prefix, break when slice reaches end. tokens is recomputed on the final string (includes prefix).
Note: tokens in the returned dict is recomputed via tokenizer.encode(chunk_content) to reflect the final string including prefix, not just the body slice length.
Technical Details¶
Why We Call Cohere Ourselves¶
LightRAG has built-in reranking support, but it's bypassed when using only_need_context=True (which we use to get raw context without LLM generation). The retrieval flow is:
- LightRAG Query: Call with
only_need_context=Trueto get raw documents - Document Processing: Convert LightRAG response to LangChain Documents
- Reranking (if enabled): Call Cohere/Jina to rerank all documents by query relevance
- Limiting (fallback): Apply type-based allocation if reranking unavailable
This approach gives us:
- Full control over the reranking process
- Unified reranking across all document types (chunks, entities, relationships)
- Ability to use the latest Cohere/Jina models
Document Types¶
Each document returned has a document_type in its metadata:
| Type | Source | Description |
|---|---|---|
chunk |
Vector search | Text chunks from indexed documents |
entity |
Knowledge graph | Extracted entities (people, companies, concepts) |
relationship |
Knowledge graph | Connections between entities |
Key Files¶
| File | Description |
|---|---|
ixrag/lightrag/langgraph_retriever.py |
Main retriever with reranking logic |
ixrag/lightrag/document_limiter.py |
Document limiting and allocation strategies |
ixrag/lightrag/document_processor.py |
Converts LightRAG responses to Documents |
ixrag/lightrag/lightrag_llm.py |
LLM and reranking function factories |
ixrag/lightrag/rag_instance_manager.py |
Shared singleton + per-tenant instance management |
Integration Points¶
| System | Purpose |
|---|---|
| LightRAG | Hybrid retrieval (vector + graph) |
| MongoDB | Vector storage backend |
| Neo4j | Graph storage backend |
| Cohere/Jina | Document reranking |
| LangFuse | Observability and tracing |
Usage¶
The retriever is typically accessed through ixchat, but can be used directly:
from ixrag.lightrag.langgraph_retriever import LangGraphRetriever
# Create retriever
retriever = LangGraphRetriever(
site_name="example-site",
rag_mode="mix",
)
# Retrieve documents
documents = await retriever.ainvoke("What are your pricing plans?")
# Each document has:
# - page_content: The text content
# - metadata: {document_type, source, rerank_score (if reranked)}
Shared Singleton Architecture (IX-1578)¶
Problem¶
Before IX-1578, every tenant got its own LightRAG instance. LightRAG initialization is expensive: it creates Neo4j drivers, MongoDB clients, loads embedding functions, and calls initialize_storages(). With more clients onboarding:
- Slow cold-start: Each new tenant required full LightRAG init, adding seconds to TTFT on the first request.
- Connection explosion: N tenants = N sets of connection pools.
- Memory pressure: N large LightRAG objects cached in memory.
Solution¶
One shared LightRAG instance, with tenant identity injected per-request via ContextVar.
The key insight: LightRAG itself is stateless with respect to tenant identity. All tenant filtering happens in the storage layer. So instead of one instance per tenant, there's a single shared instance, and storage classes resolve tenant_id dynamically from the current async task's context.
API request for "hexa.com"
|
+-- get_tenant_context("hexa.com") # pure string op, cached
| -> TenantContext(mongo="tenant_hexa_com", neo4j="hexa_com")
|
+-- set_tenant(ctx) # bind to current async task
|
+-- get_shared_rag_instance() # return the one singleton
|
+-- rag.aquery(...)
|
+-- Neo4jStorage.tenant_id -> get_tenant() -> "hexa_com"
| -> WHERE n.tenantId = "hexa_com"
|
+-- MongoStorage.tenant_id -> get_tenant() -> "tenant_hexa_com"
-> {"tenantId": "tenant_hexa_com"}
Components¶
TenantContext (ixinfra/tenant_context.py): A ContextVar providing task-local tenant identity in async code. Each asyncio.Task gets its own copy. Lives in ixinfra because it's a pure dataclass with zero storage imports.
@dataclass(frozen=True, slots=True)
class TenantContext:
company_name: str
mongo_tenant_id: str # e.g., "tenant_hexa_com"
neo4j_tenant_id: str # e.g., "hexa_com"
Singleton Factory (ixrag/lightrag/rag_instance_manager.py): Two code paths:
| Path | Function | Use Case |
|---|---|---|
| API server | get_shared_rag_instance() |
Returns the single shared instance. Double-checked locking. |
| CLI / doc loader | get_rag_instance(working_dir, key) |
Per-tenant instances for write operations (document indexing). |
get_tenant_context() replaces old DB round-trips: previously computing a tenant ID required opening a synchronous Neo4j connection just to sanitize a string. Now it's a pure in-process string operation, cached in _tenant_context_cache.
Dual-Mode tenant_id Property: Both TenantAwareNeo4JStorage and BaseTenantAwareStorage (MongoDB) resolve tenant identity dynamically:
@property
def tenant_id(self) -> str:
ctx = get_tenant() # check contextvar
if ctx is not None:
return ctx.neo4j_tenant_id # API server: contextvar wins
return self._tenant_id # CLI/test: instance attribute fallback
Write Guard: The shared singleton is read-only. If a write is attempted through it, upsert_node and upsert_edge in TenantAwareNeo4JStorage raise RuntimeError when the contextvar tenant mismatches the instance tenant ("__shared__").
Neo4j Driver Manager (ixneo4j/driver_manager.py): A ref-counted singleton AsyncDriver. Even with multiple LightRAG instances (CLI path), there's only one Neo4j connection pool per process.
Service Orchestration (ixchat/service.py): IXChatbotService ties everything together — calls get_tenant_context(), set_tenant(), then get_shared_rag_instance(). The eviction system only evicts lightweight per-tenant chatbot wrappers; the shared singleton is never evicted.
Tenant Isolation¶
Both Neo4j and MongoDB enforce isolation at every query:
- Neo4j: All
MATCHqueries injectWHERE n.tenantId = $_tenant_idvia_add_tenant_filter_to_query(). Write operations buffer nodes/edges then flush withUNWIND MERGE ... tenantId = $tenant_id. - MongoDB: All queries add
{"tenantId": self.tenant_id}filter. Document IDs are composite ("tenant_hexa_com:original-chunk-id") to prevent collisions across tenants.
Key Files¶
| File | Description |
|---|---|
ixinfra/tenant_context.py |
TenantContext dataclass + ContextVar |
ixrag/lightrag/rag_instance_manager.py |
Shared singleton + per-tenant factory |
ixneo4j/tenant_storage.py |
Dual-mode tenant_id, write guard |
ixmongo/tenant_storage.py |
Dual-mode tenant_id for MongoDB |
ixneo4j/driver_manager.py |
Ref-counted Neo4j AsyncDriver singleton |
ixchat/service.py |
IXChatbotService orchestration + eviction |
Data Consistency¶
The package includes tools for monitoring and maintaining consistency between MongoDB and Neo4j storage backends. See the ixrag/lightrag/ directory for:
cli_consistency_check.py- Check data consistencyreconcile_entities.py- Sync missing entitiesdiagnose_entity_mapping.py- Diagnose entity name issues