CLI Document Loader¶
The cli-document-loader command processes documents and builds or updates the knowledge graph for RAG (Retrieval-Augmented Generation). It supports both incremental updates and full rebuilds.
Installation¶
The CLI tool is available after activating the Poetry shell in the backend directory:
Once in the Poetry shell, the cli-document-loader command is available.
Usage¶
Environments¶
| Environment | Description |
|---|---|
test |
Test environment |
staging |
Staging environment |
production |
Production environment |
Operation Modes¶
The CLI supports four mutually exclusive operation modes:
| Mode | Description |
|---|---|
--update-tenant <tenant> |
Smart update for single tenant (safe, incremental) |
--update-all |
Smart update for ALL tenants in Supabase |
--reprocess-tenant <tenant> |
Rebuild single tenant from scratch (DELETES graph) |
--reprocess-all |
Rebuild ALL tenants from scratch (DELETES all graphs) |
Command Reference¶
Smart Update (Single Tenant)¶
Safe, incremental update that detects changes via SHA1 hashing. Creates the knowledge base if it doesn't exist.
With explicit source:
# Local directory
cli-document-loader --env staging --update-tenant example.com /path/to/docs
# Google Drive folder
cli-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz
Smart Update (All Tenants)¶
Batch processing for all tenants configured in Supabase.
Force Update Specific Documents¶
Delete and reinsert specific documents within a tenant.
# Single document
cli-document-loader --env staging --update-tenant example.com --force-update-docs unfacts
# Multiple documents (repeated flag)
cli-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2
# Multiple documents (comma-separated)
cli-document-loader --env staging --update-tenant example.com --force-update-docs "doc1,doc2,doc3"
Full Rebuild (Single Tenant)¶
Warning: This deletes the existing knowledge graph and rebuilds from scratch.
Full Rebuild (All Tenants)¶
Warning: This deletes ALL knowledge graphs and rebuilds from scratch.
Options¶
| Option | Default | Description |
|---|---|---|
--env |
(required) | Environment to run in |
--limit |
None | Limit number of files to process (useful for testing) |
--chunk-size |
512 | Token size for document chunking |
--enable-validation |
false | Enable tenant isolation validation checks |
--ignore-timestamps |
false | Reload all files, ignoring last run timestamp (Google Drive only) |
Examples¶
# Smart update single tenant (DEFAULT - safe, incremental)
cli-document-loader --env staging --update-tenant example.com
# Smart update with explicit Google Drive source
cli-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz
# Smart update ALL tenants
cli-document-loader --env staging --update-all
# Force update single document
cli-document-loader --env staging --update-tenant example.com --force-update-docs unfacts
# Force update multiple documents
cli-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2
# Rebuild single tenant (NUCLEAR - deletes graph)
cli-document-loader --env staging --reprocess-tenant example.com
# Rebuild ALL tenants (NUCLEAR - deletes all graphs)
cli-document-loader --env staging --reprocess-all
# Limit processing for testing
cli-document-loader --env staging --update-tenant example.com --limit 10
Source Resolution¶
The document loader resolves sources in the following priority:
- Explicit argument: If a path or URL is provided as a positional argument
- Supabase configuration: Custom Google Drive URL from
knowledge.google_drive_folderin site config - Default folder structure: Subfolder in
GDRIVE_KNOWLEDGE_BASE_URLmatching the tenant ID
Progress Display¶
The CLI shows a live progress display during processing:
- Current operation status
- Elapsed time
- API call count
- Slow API call count
- Recent warnings (rate limiting, retries)
Notifications¶
Slack notifications are sent on completion or failure, including:
- Success/failure status
- Tenant ID and environment
- Elapsed time and API call count
- Document statistics (new, updated, unchanged, failed)
Log Files¶
All operations are logged to a file. The log file path is displayed when running any command:
Troubleshooting¶
"Google Drive Authentication Error"¶
Ensure:
GOOGLE_DRIVE_CREDENTIALS_PATHenvironment variable is set- The credentials file exists and is a valid service account JSON key
- The service account has access to the Google Drive folder
Processing is slow¶
- Check for rate-limiting messages in the progress display
- Consider processing in smaller batches using
--limit - The dedicated Azure endpoint for processing (
AZURE_OPENAI_ENDPOINT_PROCESSING) is used automatically
Interrupted processing¶
Processing can be safely interrupted with Ctrl+C. Progress is saved, and you can resume by running the same command again.
Related Documentation¶
- CLI Tenant - Tenant management across databases
- IXRag Package - RAG system documentation