Skip to content

rose-document-loader (CLI)

The rose-document-loader CLI tool processes documents and builds or updates the knowledge graph for RAG (Retrieval-Augmented Generation). It supports both incremental updates and full rebuilds.

Installation

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the rose-document-loader command is available.

Usage

rose-document-loader --env <environment> <mode> [source] [OPTIONS]

Environments

Environment Description
test Test environment
staging Staging environment
production Production environment

Operation Modes

The CLI supports four mutually exclusive operation modes:

Mode Description
--update-tenant <tenant> Smart update for single tenant (safe, incremental)
--update-all Smart update for ALL tenants in Supabase
--reprocess-tenant <tenant> Rebuild single tenant from scratch (DELETES graph)
--reprocess-all Rebuild ALL tenants from scratch (DELETES all graphs)

Command Reference

Smart Update (Single Tenant)

Safe, incremental update that detects changes via SHA1 hashing. Creates the knowledge base if it doesn't exist.

rose-document-loader --env staging --update-tenant example.com

With explicit source:

# Local directory
rose-document-loader --env staging --update-tenant example.com /path/to/docs

# Google Drive folder
rose-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz

Smart Update (All Tenants)

Batch processing for all tenants configured in Supabase.

rose-document-loader --env staging --update-all

Force Update Specific Documents

Delete and reinsert specific documents within a tenant.

# Single document
rose-document-loader --env staging --update-tenant example.com --force-update-docs unfacts

# Multiple documents (repeated flag)
rose-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2

# Multiple documents (comma-separated)
rose-document-loader --env staging --update-tenant example.com --force-update-docs "doc1,doc2,doc3"

Full Rebuild (Single Tenant)

Warning: This deletes the existing knowledge graph and rebuilds from scratch.

rose-document-loader --env staging --reprocess-tenant example.com

Full Rebuild (All Tenants)

Warning: This deletes ALL knowledge graphs and rebuilds from scratch.

rose-document-loader --env staging --reprocess-all

Options

Option Default Description
--env (required) Environment to run in
--limit None Limit number of files to process (useful for testing)
--chunk-size 512 Token size for document chunking
--enable-validation false Enable tenant isolation validation checks
--ignore-timestamps false Reload all files, ignoring last run timestamp (Google Drive only)

Examples

# Smart update single tenant (DEFAULT - safe, incremental)
rose-document-loader --env staging --update-tenant example.com

# Smart update with explicit Google Drive source
rose-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz

# Smart update ALL tenants
rose-document-loader --env staging --update-all

# Force update single document
rose-document-loader --env staging --update-tenant example.com --force-update-docs unfacts

# Force update multiple documents
rose-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2

# Rebuild single tenant (NUCLEAR - deletes graph)
rose-document-loader --env staging --reprocess-tenant example.com

# Rebuild ALL tenants (NUCLEAR - deletes all graphs)
rose-document-loader --env staging --reprocess-all

# Limit processing for testing
rose-document-loader --env staging --update-tenant example.com --limit 10

Source Resolution

The document loader resolves sources in the following priority:

  1. Explicit argument: If a path or URL is provided as a positional argument
  2. Supabase configuration: Custom Google Drive URL from knowledge.google_drive_folder in site config
  3. Default folder structure: Subfolder in GDRIVE_KNOWLEDGE_BASE_URL matching the tenant ID

Progress Display

The CLI shows a live progress display during processing:

  • Current operation status
  • Elapsed time
  • API call count
  • Slow API call count
  • Recent warnings (rate limiting, retries)

Notifications

Slack notifications are sent on completion or failure, including:

  • Success/failure status
  • Tenant ID and environment
  • Elapsed time and API call count
  • Document statistics (new, updated, unchanged, failed)

Log Files

All operations are logged to a file. The log file path is displayed when running any command:

Log file: /path/to/log/file.log

Troubleshooting

"Google Drive Authentication Error"

Ensure:

  1. GOOGLE_DRIVE_CREDENTIALS_PATH environment variable is set
  2. The credentials file exists and is a valid service account JSON key
  3. The service account has access to the Google Drive folder

Processing is slow

  • Check for rate-limiting messages in the progress display
  • Consider processing in smaller batches using --limit
  • The dedicated Azure endpoint for processing (AZURE_OPENAI_ENDPOINT_PROCESSING) is used automatically

Interrupted processing

Processing can be safely interrupted with Ctrl+C. Progress is saved, and you can resume by running the same command again.

How documents are chunked

During ingestion the loader hands each document to LightRAG, which then calls the configured chunking_func. The default is markdown_aware_chunking — it splits on markdown headings (#, ##, ###), prepends [Section: …] context to each chunk, inlines deeper headings as body, and back-merges undersized sections to avoid orphan chunks.

If onboarding a client whose knowledge base does not benefit from heading-aware splitting, or you want to tune MAX_SPLIT_DEPTH / MIN_SECTION_TOKENS thresholds, see Chunking Strategy in the ixrag docs for behaviour, tuning knobs, and the procedure for swapping the chunker.

Force-reingest a tenant after changing the chunker so existing chunks are rebuilt under the new shape:

rose-document-loader update-tenant <tenant>.com --env <env> \
    --force-update-docs <doc-id>

KB content management — nullify and restore

nullify-documents

Replace the content of specific files in a tenant's Google Drive folder with an empty placeholder. Use to remove outdated or incorrect website pages from RAG retrieval.

# Dry run — always preview first
rose-document-loader nullify-documents <tenant> --env staging --dry-run <name1> <name2>

# Apply
rose-document-loader nullify-documents <tenant> --env staging <name1> <name2>

# Skip confirmation
rose-document-loader nullify-documents <tenant> --env staging --yes <name1> <name2>

Matching: by filename stem (no extension), glob, or normalized full path. Examples:

  • linkedin-lead-generation matches linkedin-lead-generation.md
  • pricing-* matches all pricing-* files
  • blog-linkedin-lead-generation matches website/blog/linkedin-lead-generation.md

Audit before nullify. Nullify wipes the entire file, not selected lines. Open each candidate in the KB dump and confirm it is 100% stale. If the page contains other useful content (testimonials, founder story, refund policy, FAQ entries, etc.), use a corrective FAQ + override skill instead.

After nullifying, run update-tenant to re-process.

restore-documents

Restore Google Drive file contents to their pre-nullify state via GDrive's native revision history. Counterpart to nullify-documents. No local KB dump required.

# Dry run — shows recovered revisions per file
rose-document-loader restore-documents <tenant> --env staging --dry-run <name1> <name2>

# Apply
rose-document-loader restore-documents <tenant> --env staging <name1> <name2>

How it works: for each matched file, walks GDrive's revision history newest-first and uploads the first revision whose content is not the nullified marker. If no clean revision exists for a file, that file is skipped (restore manually via GDrive UI → Manage versions).

Matching: same semantics as nullify-documents (stem / glob / normalized path).

After restoring, run update-tenant --ignore-timestamps to re-ingest the restored chunks.

Option Default Description
--env (required) Environment to run in (test, staging, production)
--dry-run false Show matching files + recovered revisions without modifying anything
--yes / -y false Skip confirmation prompt