rose-document-loader (CLI)¶
The rose-document-loader CLI tool processes documents and builds or updates the knowledge graph for RAG (Retrieval-Augmented Generation). It supports both incremental updates and full rebuilds.
Installation¶
The CLI tool is available after activating the Poetry shell in the backend directory:
Once in the Poetry shell, the rose-document-loader command is available.
Usage¶
Environments¶
| Environment | Description |
|---|---|
test |
Test environment |
staging |
Staging environment |
production |
Production environment |
Operation Modes¶
The CLI supports four mutually exclusive operation modes:
| Mode | Description |
|---|---|
--update-tenant <tenant> |
Smart update for single tenant (safe, incremental) |
--update-all |
Smart update for ALL tenants in Supabase |
--reprocess-tenant <tenant> |
Rebuild single tenant from scratch (DELETES graph) |
--reprocess-all |
Rebuild ALL tenants from scratch (DELETES all graphs) |
Command Reference¶
Smart Update (Single Tenant)¶
Safe, incremental update that detects changes via SHA1 hashing. Creates the knowledge base if it doesn't exist.
With explicit source:
# Local directory
rose-document-loader --env staging --update-tenant example.com /path/to/docs
# Google Drive folder
rose-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz
Smart Update (All Tenants)¶
Batch processing for all tenants configured in Supabase.
Force Update Specific Documents¶
Delete and reinsert specific documents within a tenant.
# Single document
rose-document-loader --env staging --update-tenant example.com --force-update-docs unfacts
# Multiple documents (repeated flag)
rose-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2
# Multiple documents (comma-separated)
rose-document-loader --env staging --update-tenant example.com --force-update-docs "doc1,doc2,doc3"
Full Rebuild (Single Tenant)¶
Warning: This deletes the existing knowledge graph and rebuilds from scratch.
Full Rebuild (All Tenants)¶
Warning: This deletes ALL knowledge graphs and rebuilds from scratch.
Options¶
| Option | Default | Description |
|---|---|---|
--env |
(required) | Environment to run in |
--limit |
None | Limit number of files to process (useful for testing) |
--chunk-size |
512 | Token size for document chunking |
--enable-validation |
false | Enable tenant isolation validation checks |
--ignore-timestamps |
false | Reload all files, ignoring last run timestamp (Google Drive only) |
Examples¶
# Smart update single tenant (DEFAULT - safe, incremental)
rose-document-loader --env staging --update-tenant example.com
# Smart update with explicit Google Drive source
rose-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz
# Smart update ALL tenants
rose-document-loader --env staging --update-all
# Force update single document
rose-document-loader --env staging --update-tenant example.com --force-update-docs unfacts
# Force update multiple documents
rose-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2
# Rebuild single tenant (NUCLEAR - deletes graph)
rose-document-loader --env staging --reprocess-tenant example.com
# Rebuild ALL tenants (NUCLEAR - deletes all graphs)
rose-document-loader --env staging --reprocess-all
# Limit processing for testing
rose-document-loader --env staging --update-tenant example.com --limit 10
Source Resolution¶
The document loader resolves sources in the following priority:
- Explicit argument: If a path or URL is provided as a positional argument
- Supabase configuration: Custom Google Drive URL from
knowledge.google_drive_folderin site config - Default folder structure: Subfolder in
GDRIVE_KNOWLEDGE_BASE_URLmatching the tenant ID
Progress Display¶
The CLI shows a live progress display during processing:
- Current operation status
- Elapsed time
- API call count
- Slow API call count
- Recent warnings (rate limiting, retries)
Notifications¶
Slack notifications are sent on completion or failure, including:
- Success/failure status
- Tenant ID and environment
- Elapsed time and API call count
- Document statistics (new, updated, unchanged, failed)
Log Files¶
All operations are logged to a file. The log file path is displayed when running any command:
Troubleshooting¶
"Google Drive Authentication Error"¶
Ensure:
GOOGLE_DRIVE_CREDENTIALS_PATHenvironment variable is set- The credentials file exists and is a valid service account JSON key
- The service account has access to the Google Drive folder
Processing is slow¶
- Check for rate-limiting messages in the progress display
- Consider processing in smaller batches using
--limit - The dedicated Azure endpoint for processing (
AZURE_OPENAI_ENDPOINT_PROCESSING) is used automatically
Interrupted processing¶
Processing can be safely interrupted with Ctrl+C. Progress is saved, and you can resume by running the same command again.
How documents are chunked¶
During ingestion the loader hands each document to LightRAG, which then calls the configured chunking_func. The default is markdown_aware_chunking — it splits on markdown headings (#, ##, ###), prepends [Section: …] context to each chunk, inlines deeper headings as body, and back-merges undersized sections to avoid orphan chunks.
If onboarding a client whose knowledge base does not benefit from heading-aware splitting, or you want to tune MAX_SPLIT_DEPTH / MIN_SECTION_TOKENS thresholds, see Chunking Strategy in the ixrag docs for behaviour, tuning knobs, and the procedure for swapping the chunker.
Force-reingest a tenant after changing the chunker so existing chunks are rebuilt under the new shape:
KB content management — nullify and restore¶
nullify-documents¶
Replace the content of specific files in a tenant's Google Drive folder with an empty placeholder. Use to remove outdated or incorrect website pages from RAG retrieval.
# Dry run — always preview first
rose-document-loader nullify-documents <tenant> --env staging --dry-run <name1> <name2>
# Apply
rose-document-loader nullify-documents <tenant> --env staging <name1> <name2>
# Skip confirmation
rose-document-loader nullify-documents <tenant> --env staging --yes <name1> <name2>
Matching: by filename stem (no extension), glob, or normalized full path. Examples:
linkedin-lead-generationmatcheslinkedin-lead-generation.mdpricing-*matches all pricing-* filesblog-linkedin-lead-generationmatcheswebsite/blog/linkedin-lead-generation.md
Audit before nullify. Nullify wipes the entire file, not selected lines. Open each candidate in the KB dump and confirm it is 100% stale. If the page contains other useful content (testimonials, founder story, refund policy, FAQ entries, etc.), use a corrective FAQ + override skill instead.
After nullifying, run update-tenant to re-process.
restore-documents¶
Restore Google Drive file contents to their pre-nullify state via GDrive's native revision history. Counterpart to nullify-documents. No local KB dump required.
# Dry run — shows recovered revisions per file
rose-document-loader restore-documents <tenant> --env staging --dry-run <name1> <name2>
# Apply
rose-document-loader restore-documents <tenant> --env staging <name1> <name2>
How it works: for each matched file, walks GDrive's revision history newest-first and uploads the first revision whose content is not the nullified marker. If no clean revision exists for a file, that file is skipped (restore manually via GDrive UI → Manage versions).
Matching: same semantics as nullify-documents (stem / glob / normalized path).
After restoring, run update-tenant --ignore-timestamps to re-ingest the restored chunks.
| Option | Default | Description |
|---|---|---|
--env |
(required) | Environment to run in (test, staging, production) |
--dry-run |
false | Show matching files + recovered revisions without modifying anything |
--yes / -y |
false | Skip confirmation prompt |
Related Documentation¶
- CLI Tenant - Tenant management across databases
- IXRag Package - RAG system documentation
- Chunking Strategy - Markdown-aware chunker behaviour and how to change it