Skip to content

CLI Document Loader

The cli-document-loader command processes documents and builds or updates the knowledge graph for RAG (Retrieval-Augmented Generation). It supports both incremental updates and full rebuilds.

Installation

The CLI tool is available after activating the Poetry shell in the backend directory:

cd backend
poetry shell

Once in the Poetry shell, the cli-document-loader command is available.

Usage

cli-document-loader --env <environment> <mode> [source] [OPTIONS]

Environments

Environment Description
test Test environment
staging Staging environment
production Production environment

Operation Modes

The CLI supports four mutually exclusive operation modes:

Mode Description
--update-tenant <tenant> Smart update for single tenant (safe, incremental)
--update-all Smart update for ALL tenants in Supabase
--reprocess-tenant <tenant> Rebuild single tenant from scratch (DELETES graph)
--reprocess-all Rebuild ALL tenants from scratch (DELETES all graphs)

Command Reference

Smart Update (Single Tenant)

Safe, incremental update that detects changes via SHA1 hashing. Creates the knowledge base if it doesn't exist.

cli-document-loader --env staging --update-tenant example.com

With explicit source:

# Local directory
cli-document-loader --env staging --update-tenant example.com /path/to/docs

# Google Drive folder
cli-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz

Smart Update (All Tenants)

Batch processing for all tenants configured in Supabase.

cli-document-loader --env staging --update-all

Force Update Specific Documents

Delete and reinsert specific documents within a tenant.

# Single document
cli-document-loader --env staging --update-tenant example.com --force-update-docs unfacts

# Multiple documents (repeated flag)
cli-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2

# Multiple documents (comma-separated)
cli-document-loader --env staging --update-tenant example.com --force-update-docs "doc1,doc2,doc3"

Full Rebuild (Single Tenant)

Warning: This deletes the existing knowledge graph and rebuilds from scratch.

cli-document-loader --env staging --reprocess-tenant example.com

Full Rebuild (All Tenants)

Warning: This deletes ALL knowledge graphs and rebuilds from scratch.

cli-document-loader --env staging --reprocess-all

Options

Option Default Description
--env (required) Environment to run in
--limit None Limit number of files to process (useful for testing)
--chunk-size 512 Token size for document chunking
--enable-validation false Enable tenant isolation validation checks
--ignore-timestamps false Reload all files, ignoring last run timestamp (Google Drive only)

Examples

# Smart update single tenant (DEFAULT - safe, incremental)
cli-document-loader --env staging --update-tenant example.com

# Smart update with explicit Google Drive source
cli-document-loader --env staging --update-tenant example.com https://drive.google.com/drive/folders/xyz

# Smart update ALL tenants
cli-document-loader --env staging --update-all

# Force update single document
cli-document-loader --env staging --update-tenant example.com --force-update-docs unfacts

# Force update multiple documents
cli-document-loader --env staging --update-tenant example.com --force-update-docs doc1 --force-update-docs doc2

# Rebuild single tenant (NUCLEAR - deletes graph)
cli-document-loader --env staging --reprocess-tenant example.com

# Rebuild ALL tenants (NUCLEAR - deletes all graphs)
cli-document-loader --env staging --reprocess-all

# Limit processing for testing
cli-document-loader --env staging --update-tenant example.com --limit 10

Source Resolution

The document loader resolves sources in the following priority:

  1. Explicit argument: If a path or URL is provided as a positional argument
  2. Supabase configuration: Custom Google Drive URL from knowledge.google_drive_folder in site config
  3. Default folder structure: Subfolder in GDRIVE_KNOWLEDGE_BASE_URL matching the tenant ID

Progress Display

The CLI shows a live progress display during processing:

  • Current operation status
  • Elapsed time
  • API call count
  • Slow API call count
  • Recent warnings (rate limiting, retries)

Notifications

Slack notifications are sent on completion or failure, including:

  • Success/failure status
  • Tenant ID and environment
  • Elapsed time and API call count
  • Document statistics (new, updated, unchanged, failed)

Log Files

All operations are logged to a file. The log file path is displayed when running any command:

Log file: /path/to/log/file.log

Troubleshooting

"Google Drive Authentication Error"

Ensure:

  1. GOOGLE_DRIVE_CREDENTIALS_PATH environment variable is set
  2. The credentials file exists and is a valid service account JSON key
  3. The service account has access to the Google Drive folder

Processing is slow

  • Check for rate-limiting messages in the progress display
  • Consider processing in smaller batches using --limit
  • The dedicated Azure endpoint for processing (AZURE_OPENAI_ENDPOINT_PROCESSING) is used automatically

Interrupted processing

Processing can be safely interrupted with Ctrl+C. Progress is saved, and you can resume by running the same command again.