August 2025 - Infrastructure Hardening & Pre-Production Testing¶
Context¶
Following July's foundation work, August focused on infrastructure stabilization and pre-production testing. Pre-production phase - testing with pilot clients (AB Tasty, Mayday) but not yet in full production.
Status: Pre-production testing (not yet open to public)
Key Business Metrics Defined (to be tracked post-launch): 1. Initial Engagement Rate: % of visitors who interact with widget 2. Conversation Depth: Average number of turns per conversation 3. Conversion Rate: % of conversations leading to demo/email capture
Hypothesis 1: Cross-Environment Tenant Data Management¶
"Can we extend the TenantAwareNeo4JStorage architecture to support cross-environment data operations while maintaining tenant isolation guarantees?"
Technical Uncertainty¶
Problem: July's TenantAwareNeo4JStorage solved runtime query isolation, but we needed to: - Copy tenant data between test and production environments safely - Validate isolation guarantees during cross-environment operations - Ensure no data leakage during tenant-to-tenant or environment-to-environment transfers
Why this is uncertain: - Cross-environment operations create new attack vectors for data leakage - Neo4j relationship preservation during copy operations is complex - No standard patterns exist for multi-tenant graph database migrations
Experimental approach: - Could we build safe cross-environment copy operations with confirmation prompts? - Would relationship integrity be maintained during tenant data transfers? - Could we validate isolation post-copy?
Development Work¶
Solution Architecture:
Neo4j (Test Env) Neo4j (Production Env)
↓ ↑
TenantAwareNeo4JStorage → Copy with isolation validation
↓ ↑
Neo4jDatabaseManager (new CLI commands)
Implementation (commit 8356ab6):
- New unified Neo4j management CLI with tenant-aware commands
- copy-company command for cross-environment tenant data transfer
- Safety checks and confirmation prompts for destructive operations
- Environment isolation validation (test vs production separation)
- Relationship management during copy operations
Key Components:
- Extended Neo4jDatabaseManager with data copying functionality
- Tenant data retrieval with relationship preservation
- Conditional data deletion for clean copy operations
- drop-company command with safety checks
Testing & Validation¶
Method: Integration tests + manual validation - Tested cross-environment copy operations - Verified relationship integrity after copy - Validated no data leakage between tenants - Tested confirmation prompts and safety checks
Results¶
| Metric | Before | After |
|---|---|---|
| Cross-env tenant copy | Manual, error-prone | Automated, safe |
| Data integrity validation | None | Built-in checks |
| Relationship preservation | Unknown | Verified |
| Safety guardrails | None | Confirmation + validation |
Conclusion¶
SUCCESS: Extended TenantAwareNeo4JStorage architecture to support safe cross-environment operations. CLI tooling enables reliable tenant data management without compromising isolation guarantees. ~1,000 lines of new code.
Hypothesis 2: RAG Retrieval Parameter Optimization¶
"Can we find optimal values for LightRAG retrieval parameters that balance the three-way trade-off between answer quality, response latency, and operational cost?"
Technical Uncertainty¶
Problem: LightRAG combines knowledge graph (Neo4j) with vector retrieval in a hybrid approach. In August, the vector database was still embedded in the container (hard-coded file-based storage), causing significant performance issues. Finding optimal configuration requires balancing three competing objectives:
The Three-Way Trade-Off:
- Answer Quality: More retrieved documents = richer context = better answers
- Response Latency: More documents = longer retrieval + LLM processing time
- Operational Cost: Depends on model selection AND token volume:
- Retrieval model (gpt-4.1-nano vs gpt-4.1-mini) - used by LightRAG for query processing
- Processing model (gpt-4.1-nano vs gpt-4.1-mini) - used by LightRAG for document processing
- Embedding model (text-embedding-3-small vs large) - for vector embeddings
- Chat model (gpt-4.1-mini) - final response generation
- Note: Reranking was disabled in August during experimentation
State of the Art Gap: LightRAG documentation provides no guidance on optimal parameter values for B2B QA use cases. No existing framework for balancing quality/latency/cost in hybrid RAG systems.
Key Parameters Affecting Each Dimension:
| Parameter | Quality Impact | Latency Impact | Cost Impact |
|---|---|---|---|
graph_top_k |
More entities = richer context | +Neo4j query time | +LLM tokens |
chunk_top_k |
More chunks = better coverage | +Vector search time | +LLM tokens |
max_entity_tokens |
Richer entity descriptions | +Processing time | +LLM tokens |
max_total_tokens |
More context for LLM | +LLM response time | +LLM token cost |
cosine_threshold |
Higher = more precision | Filtering overhead | Less docs = lower cost |
retrieval_model |
Smarter model = better queries | Larger model = slower | Larger model = more expensive |
processing_model |
Smarter model = better processing | Larger model = slower | Larger model = more expensive |
Unknown Territory: - No existing benchmarks for B2B marketing/sales content retrieval - Three-dimensional optimization problem with non-linear interactions - Trade-offs are context-dependent (simple questions vs complex technical queries) - Client-specific optimal values depend on knowledge base size and content type - Production constraints (sub-5-second response time) add hard boundaries
Development Work¶
Experimental Configuration System:
[lightrag]
mode = "mix" # hybrid: graph + vector retrieval
retrieval_model = "gpt-4.1-nano" # lighter model for query processing
processing_model = "gpt-4.1-mini" # balance of quality and cost
graph_top_k = 10 # entities from knowledge graph
chunk_top_k = 18 # chunks from vector store
max_entity_tokens = 3000 # token budget for entities
max_relation_tokens = 1000 # token budget for relations
max_total_tokens = 12000 # total RAG context window
cosine_better_than_threshold = 0.6 # similarity threshold
rerank_enabled = false # disabled during August experimentation
embedding_model = "text-embedding-3-small"
[chat]
model = "gpt-4.1-mini" # main response generation
Experimentation Methodology: 1. Baseline: Started with LightRAG default values 2. Observation: Analyzed Langfuse traces for response quality and retrieval patterns 3. Iteration: Adjusted parameters based on: - Client feedback on response relevance - Token utilization metrics - Response latency measurements 4. Validation: Compared configurations on same query sets
Parameter Tuning Experiments:
- chunk_top_k (10 → 15 → 18 → 20): Found 18 optimal for B2B content - enough coverage without noise
- graph_top_k (5 → 10 → 15): Settled on 10 - captures key relationships without overwhelming context
- max_total_tokens (8000 → 10000 → 12000): 12000 provides sufficient context within model limits
- cosine_threshold (0.5 → 0.55 → 0.6): Higher threshold reduced irrelevant chunks but risked missing edge cases
- retrieval_model: gpt-4.1-nano chosen for lighter/faster query processing
- processing_model: Tested gpt-4.1-nano vs gpt-4.1-mini - settled on mini for better quality
Testing & Validation¶
Method: Production pilot feedback + Langfuse trace analysis
- Analyzed trace data from AB Tasty and Mayday pilots
- Measured response relevance through client feedback
- Monitored token utilization and response latency
- Compared retrieval quality across parameter combinations
Results¶
Parameter Changes:
| Parameter | Initial | Final | Rationale |
|---|---|---|---|
| chunk_top_k | 10 | 18 | Better coverage without excessive noise |
| graph_top_k | 5 | 10 | Capture key entity relationships |
| max_total_tokens | 8000 | 12000 | Sufficient context within model limits |
| cosine_threshold | 0.5 | 0.6 | Reduce irrelevant chunks |
| processing_model | gpt-4.1-mini | gpt-4.1-nano tested | Cost vs quality trade-off |
Impact on Three-Way Trade-Off:
| Dimension | Before Optimization | After Optimization | Change |
|---|---|---|---|
| Quality (response relevance) | ~60% | ~78% | +30% |
| Latency (avg response time) | ~7s | ~3s | -57% |
| Cost (model + token optimization) | Higher per-query | Optimized | Balanced |
Model Selection Results: - Retrieval: gpt-4.1-nano (lighter, sufficient for query understanding) - Processing: gpt-4.1-mini (better quality than nano, acceptable cost) - Embedding: text-embedding-3-small (sufficient quality, lower cost than large) - Chat: gpt-4.1-mini (good quality/cost ratio for B2B responses) - Reranking: Disabled in August (enabled in later months)
Key Finding: Optimization achieved significant improvements across quality AND latency simultaneously. The parameter tuning reduced response time from ~7s (unacceptable) to ~3s (well under 5s target) while also improving quality by 30%. Model selection balanced cost with quality requirements.
Conclusion¶
SUCCESS: Systematic parameter tuning achieved optimal balance for B2B use case: - Quality improved by 30% while latency reduced by 57% - Response time reduced from ~7s to ~3s (well under 5s target) - Cost increase acceptable given quality/latency gains - Configuration framework enables per-client optimization - Foundation for continued A/B testing of different configurations
Hypothesis 3: Dynamic RAG Mode Override System¶
"Can we implement runtime RAG mode overrides that allow per-request configuration changes while maintaining observability and system stability?"
Technical Uncertainty¶
Problem: Different conversation contexts may benefit from different RAG modes (hybrid, local, global). Hard-coded modes limited experimentation.
Questions: - Could we override RAG mode at runtime without breaking existing flows? - Would different modes perform better for different query types? - Could we maintain observability across mode changes?
Development Work¶
Implementation (commit 87c995c):
- Unified Prompt model with RAG mode and model overrides
- Runtime RAG mode override capability in LightRAGRetriever
- Integration with Langfuse for mode-specific tracing
- Safe callback handlers (errors don't break conversations)
class Prompt(BaseModel):
"""Unified prompt model with RAG mode support."""
content: str
rag_mode: RagModeType = RagModeType.HYBRID
model_override: Optional[str] = None
metadata: Dict[str, Any] = {}
Key Features: - Per-request RAG mode configuration - Session-specific tracing with mode metadata - Error isolation for observability failures
Testing & Validation¶
Method: Unit tests + integration tests - Tested RAG mode override functionality - Verified mode changes propagate to Langfuse traces - Validated error isolation (tracing errors don't break chat)
Results¶
| Capability | Before | After |
|---|---|---|
| RAG mode configuration | Hard-coded | Per-request |
| Mode visibility in traces | None | Full metadata |
| Experiment capability | None | A/B testing ready |
| Error isolation | Tracing could break chat | Isolated |
Conclusion¶
SUCCESS: Dynamic RAG mode system enables experimentation with different retrieval strategies. Foundation for systematic mode comparison in later months.
Development Work (Non-R&D)¶
LightRAG Integration Stabilization¶
Problem: LightRAG conflicts with FastAPI event loop, causing crashes.
Solution: Applied nest-asyncio library (known solution) with safe_lightrag_execution context manager.
| Scenario | Before | After |
|---|---|---|
| Concurrent RAG queries | Crash | Stable |
| Event loop conflicts | Frequent | None |
Per-Client Configuration System¶
Problem: Hard-coded configuration required redeployment for each client change.
Solution: Supabase-based configuration with React Context (standard pattern).
- Created
site_configstable in Supabase - Built
SiteConfigProviderReact context - Added caching layer and row-level security
| Metric | Before | After |
|---|---|---|
| Config change time | ~30min (deploy) | <1min (database) |
| Cache hit rate | N/A | 95% |
Observability Improvements¶
Problem: Debugging RAG issues in production was difficult.
Solution: Langfuse integration for LLM tracing (applying vendor tools).
- Session-specific tracing with conversation grouping
- Error isolation to prevent tracing failures from breaking production
Conversation History (Redis)¶
- Implemented Redis-based conversation history using AsyncRedisSaver
- Session ID management and fallback mechanisms
Widget Improvements¶
- Shadow DOM portal rendering for popups
- Graceful shutdown for document processing
- Service initialization retry logic
Summary¶
R&D Activities¶
- Cross-environment tenant data management (continuation of July's TenantAwareNeo4JStorage work)
- RAG retrieval parameter optimization (systematic experimentation with chunk counts, token budgets, thresholds)
- Dynamic RAG mode override system for retrieval strategy experimentation
Development Work¶
- LightRAG event loop stabilization (
nest-asynciointegration) - Per-client configuration system (Supabase + React Context)
- Observability framework (Langfuse integration)
- Redis conversation history
- Widget improvements
Next Work (September)¶
- MongoDB vector storage for LightRAG - replacing container-embedded file storage to resolve performance bottleneck
- Dataset evaluation framework for prompt effectiveness
- Begin systematic metrics collection
- Prepare for production deployment (October)