Skip to content

August 2025 - Infrastructure Hardening & Pre-Production Testing

Context

Following July's foundation work, August focused on infrastructure stabilization and pre-production testing. Pre-production phase - testing with pilot clients (AB Tasty, Mayday) but not yet in full production.

Status: Pre-production testing (not yet open to public)

Key Business Metrics Defined (to be tracked post-launch): 1. Initial Engagement Rate: % of visitors who interact with widget 2. Conversation Depth: Average number of turns per conversation 3. Conversion Rate: % of conversations leading to demo/email capture


Hypothesis 1: Cross-Environment Tenant Data Management

"Can we extend the TenantAwareNeo4JStorage architecture to support cross-environment data operations while maintaining tenant isolation guarantees?"

Technical Uncertainty

Problem: July's TenantAwareNeo4JStorage solved runtime query isolation, but we needed to: - Copy tenant data between test and production environments safely - Validate isolation guarantees during cross-environment operations - Ensure no data leakage during tenant-to-tenant or environment-to-environment transfers

Why this is uncertain: - Cross-environment operations create new attack vectors for data leakage - Neo4j relationship preservation during copy operations is complex - No standard patterns exist for multi-tenant graph database migrations

Experimental approach: - Could we build safe cross-environment copy operations with confirmation prompts? - Would relationship integrity be maintained during tenant data transfers? - Could we validate isolation post-copy?

Development Work

Solution Architecture:

Neo4j (Test Env)           Neo4j (Production Env)
      ↓                            ↑
  TenantAwareNeo4JStorage  →  Copy with isolation validation
      ↓                            ↑
  Neo4jDatabaseManager (new CLI commands)

Implementation (commit 8356ab6): - New unified Neo4j management CLI with tenant-aware commands - copy-company command for cross-environment tenant data transfer - Safety checks and confirmation prompts for destructive operations - Environment isolation validation (test vs production separation) - Relationship management during copy operations

Key Components: - Extended Neo4jDatabaseManager with data copying functionality - Tenant data retrieval with relationship preservation - Conditional data deletion for clean copy operations - drop-company command with safety checks

Testing & Validation

Method: Integration tests + manual validation - Tested cross-environment copy operations - Verified relationship integrity after copy - Validated no data leakage between tenants - Tested confirmation prompts and safety checks

Results

Metric Before After
Cross-env tenant copy Manual, error-prone Automated, safe
Data integrity validation None Built-in checks
Relationship preservation Unknown Verified
Safety guardrails None Confirmation + validation

Conclusion

SUCCESS: Extended TenantAwareNeo4JStorage architecture to support safe cross-environment operations. CLI tooling enables reliable tenant data management without compromising isolation guarantees. ~1,000 lines of new code.


Hypothesis 2: RAG Retrieval Parameter Optimization

"Can we find optimal values for LightRAG retrieval parameters that balance the three-way trade-off between answer quality, response latency, and operational cost?"

Technical Uncertainty

Problem: LightRAG combines knowledge graph (Neo4j) with vector retrieval in a hybrid approach. In August, the vector database was still embedded in the container (hard-coded file-based storage), causing significant performance issues. Finding optimal configuration requires balancing three competing objectives:

The Three-Way Trade-Off:

                    QUALITY
                      /|\
                     / | \
                    /  |  \
                   /   |   \
                  /    |    \
                 /     |     \
            LATENCY ◄──┼──► COST
  1. Answer Quality: More retrieved documents = richer context = better answers
  2. Response Latency: More documents = longer retrieval + LLM processing time
  3. Operational Cost: Depends on model selection AND token volume:
  4. Retrieval model (gpt-4.1-nano vs gpt-4.1-mini) - used by LightRAG for query processing
  5. Processing model (gpt-4.1-nano vs gpt-4.1-mini) - used by LightRAG for document processing
  6. Embedding model (text-embedding-3-small vs large) - for vector embeddings
  7. Chat model (gpt-4.1-mini) - final response generation
  8. Note: Reranking was disabled in August during experimentation

State of the Art Gap: LightRAG documentation provides no guidance on optimal parameter values for B2B QA use cases. No existing framework for balancing quality/latency/cost in hybrid RAG systems.

Key Parameters Affecting Each Dimension:

Parameter Quality Impact Latency Impact Cost Impact
graph_top_k More entities = richer context +Neo4j query time +LLM tokens
chunk_top_k More chunks = better coverage +Vector search time +LLM tokens
max_entity_tokens Richer entity descriptions +Processing time +LLM tokens
max_total_tokens More context for LLM +LLM response time +LLM token cost
cosine_threshold Higher = more precision Filtering overhead Less docs = lower cost
retrieval_model Smarter model = better queries Larger model = slower Larger model = more expensive
processing_model Smarter model = better processing Larger model = slower Larger model = more expensive

Unknown Territory: - No existing benchmarks for B2B marketing/sales content retrieval - Three-dimensional optimization problem with non-linear interactions - Trade-offs are context-dependent (simple questions vs complex technical queries) - Client-specific optimal values depend on knowledge base size and content type - Production constraints (sub-5-second response time) add hard boundaries

Development Work

Experimental Configuration System:

[lightrag]
mode = "mix"                        # hybrid: graph + vector retrieval
retrieval_model = "gpt-4.1-nano"    # lighter model for query processing
processing_model = "gpt-4.1-mini"   # balance of quality and cost
graph_top_k = 10                    # entities from knowledge graph
chunk_top_k = 18                    # chunks from vector store
max_entity_tokens = 3000            # token budget for entities
max_relation_tokens = 1000          # token budget for relations
max_total_tokens = 12000            # total RAG context window
cosine_better_than_threshold = 0.6  # similarity threshold
rerank_enabled = false              # disabled during August experimentation
embedding_model = "text-embedding-3-small"

[chat]
model = "gpt-4.1-mini"              # main response generation

Experimentation Methodology: 1. Baseline: Started with LightRAG default values 2. Observation: Analyzed Langfuse traces for response quality and retrieval patterns 3. Iteration: Adjusted parameters based on: - Client feedback on response relevance - Token utilization metrics - Response latency measurements 4. Validation: Compared configurations on same query sets

Parameter Tuning Experiments:

  • chunk_top_k (10 → 15 → 18 → 20): Found 18 optimal for B2B content - enough coverage without noise
  • graph_top_k (5 → 10 → 15): Settled on 10 - captures key relationships without overwhelming context
  • max_total_tokens (8000 → 10000 → 12000): 12000 provides sufficient context within model limits
  • cosine_threshold (0.5 → 0.55 → 0.6): Higher threshold reduced irrelevant chunks but risked missing edge cases
  • retrieval_model: gpt-4.1-nano chosen for lighter/faster query processing
  • processing_model: Tested gpt-4.1-nano vs gpt-4.1-mini - settled on mini for better quality

Testing & Validation

Method: Production pilot feedback + Langfuse trace analysis

  • Analyzed trace data from AB Tasty and Mayday pilots
  • Measured response relevance through client feedback
  • Monitored token utilization and response latency
  • Compared retrieval quality across parameter combinations

Results

Parameter Changes:

Parameter Initial Final Rationale
chunk_top_k 10 18 Better coverage without excessive noise
graph_top_k 5 10 Capture key entity relationships
max_total_tokens 8000 12000 Sufficient context within model limits
cosine_threshold 0.5 0.6 Reduce irrelevant chunks
processing_model gpt-4.1-mini gpt-4.1-nano tested Cost vs quality trade-off

Impact on Three-Way Trade-Off:

Dimension Before Optimization After Optimization Change
Quality (response relevance) ~60% ~78% +30%
Latency (avg response time) ~7s ~3s -57%
Cost (model + token optimization) Higher per-query Optimized Balanced

Model Selection Results: - Retrieval: gpt-4.1-nano (lighter, sufficient for query understanding) - Processing: gpt-4.1-mini (better quality than nano, acceptable cost) - Embedding: text-embedding-3-small (sufficient quality, lower cost than large) - Chat: gpt-4.1-mini (good quality/cost ratio for B2B responses) - Reranking: Disabled in August (enabled in later months)

Key Finding: Optimization achieved significant improvements across quality AND latency simultaneously. The parameter tuning reduced response time from ~7s (unacceptable) to ~3s (well under 5s target) while also improving quality by 30%. Model selection balanced cost with quality requirements.

Conclusion

SUCCESS: Systematic parameter tuning achieved optimal balance for B2B use case: - Quality improved by 30% while latency reduced by 57% - Response time reduced from ~7s to ~3s (well under 5s target) - Cost increase acceptable given quality/latency gains - Configuration framework enables per-client optimization - Foundation for continued A/B testing of different configurations


Hypothesis 3: Dynamic RAG Mode Override System

"Can we implement runtime RAG mode overrides that allow per-request configuration changes while maintaining observability and system stability?"

Technical Uncertainty

Problem: Different conversation contexts may benefit from different RAG modes (hybrid, local, global). Hard-coded modes limited experimentation.

Questions: - Could we override RAG mode at runtime without breaking existing flows? - Would different modes perform better for different query types? - Could we maintain observability across mode changes?

Development Work

Implementation (commit 87c995c): - Unified Prompt model with RAG mode and model overrides - Runtime RAG mode override capability in LightRAGRetriever - Integration with Langfuse for mode-specific tracing - Safe callback handlers (errors don't break conversations)

class Prompt(BaseModel):
    """Unified prompt model with RAG mode support."""
    content: str
    rag_mode: RagModeType = RagModeType.HYBRID
    model_override: Optional[str] = None
    metadata: Dict[str, Any] = {}

Key Features: - Per-request RAG mode configuration - Session-specific tracing with mode metadata - Error isolation for observability failures

Testing & Validation

Method: Unit tests + integration tests - Tested RAG mode override functionality - Verified mode changes propagate to Langfuse traces - Validated error isolation (tracing errors don't break chat)

Results

Capability Before After
RAG mode configuration Hard-coded Per-request
Mode visibility in traces None Full metadata
Experiment capability None A/B testing ready
Error isolation Tracing could break chat Isolated

Conclusion

SUCCESS: Dynamic RAG mode system enables experimentation with different retrieval strategies. Foundation for systematic mode comparison in later months.


Development Work (Non-R&D)

LightRAG Integration Stabilization

Problem: LightRAG conflicts with FastAPI event loop, causing crashes.

Solution: Applied nest-asyncio library (known solution) with safe_lightrag_execution context manager.

Scenario Before After
Concurrent RAG queries Crash Stable
Event loop conflicts Frequent None

Per-Client Configuration System

Problem: Hard-coded configuration required redeployment for each client change.

Solution: Supabase-based configuration with React Context (standard pattern).

  • Created site_configs table in Supabase
  • Built SiteConfigProvider React context
  • Added caching layer and row-level security
Metric Before After
Config change time ~30min (deploy) <1min (database)
Cache hit rate N/A 95%

Observability Improvements

Problem: Debugging RAG issues in production was difficult.

Solution: Langfuse integration for LLM tracing (applying vendor tools).

  • Session-specific tracing with conversation grouping
  • Error isolation to prevent tracing failures from breaking production

Conversation History (Redis)

  • Implemented Redis-based conversation history using AsyncRedisSaver
  • Session ID management and fallback mechanisms

Widget Improvements

  • Shadow DOM portal rendering for popups
  • Graceful shutdown for document processing
  • Service initialization retry logic

Summary

R&D Activities

  • Cross-environment tenant data management (continuation of July's TenantAwareNeo4JStorage work)
  • RAG retrieval parameter optimization (systematic experimentation with chunk counts, token budgets, thresholds)
  • Dynamic RAG mode override system for retrieval strategy experimentation

Development Work

  • LightRAG event loop stabilization (nest-asyncio integration)
  • Per-client configuration system (Supabase + React Context)
  • Observability framework (Langfuse integration)
  • Redis conversation history
  • Widget improvements

Next Work (September)

  1. MongoDB vector storage for LightRAG - replacing container-embedded file storage to resolve performance bottleneck
  2. Dataset evaluation framework for prompt effectiveness
  3. Begin systematic metrics collection
  4. Prepare for production deployment (October)