Chapter 1
Discovery — Uncovering 22 Issues
LangWatch experiments — 7 suites, 87 traces, 52 test cases — 2026-03-18
Overview
87
Total Traces
22
Issues Found
65
Clean Runs
74.7%
Pass Rate
121s
Worst Latency
Issue Distribution
Latency Distribution (seconds)
Issue Breakdown
P0
4 occurrences
Raw JSON Leak — node_ids as output
The LLM's
{"node_ids": [...]} selection response from policy_graphrag.py:860 leaks through as the final answer. The response_planner echoes the retrieval agent's intermediate output instead of consuming it.
- "What about PCP visits?" (multi-turn)
- "And for Basic Option?" (multi-turn)
- "¿Cuál es el deducible...?" (Spanish)
- "Rx copays under Standard Option?"
P1
5 occurrences
Bad File Descriptor — pipeline crash
SQLite / PageIndex file handles exhausted under parallel execution (
threads=2). Pipeline returns [Errno 9] Bad file descriptor with no answer.
- "Escalate this claim to SIU"
- "Standard Option deductible"
- "Basic specialist visit costs"
- "Standard inpatient copay"
- "Basic OOP max"
P2
9 occurrences
Slow Queries — over 60 seconds
Multi-hop GraphRAG traversal with LLM node selection is slow on complex or multi-section queries. Worst case: 121s for Rx copays.
- Rx copays: 121s
- Acupuncture coverage: 87s
- OOP max Basic: 74s
- Spanish deductible: 58s
- 4 more at 60-69s
P3
2 failures
Spanish Language Retrieval
GraphRAG node titles are in English. Spanish queries get raw JSON or insufficient answers because the embedding/title matching fails cross-lingually.
P4
2 expected gaps
Missing Data Sources
Process questions and claim-specific queries return "unable to" because only the policy brochure is indexed. Not a bug — expected for the current policy-only index.
Affected Queries
| Query | Suite | Issue | Latency | Status |
|---|---|---|---|---|
| "What about PCP visits?" | Multi-Turn | Raw JSON leak | 60s | Fail |
| "And for Basic Option?" | Multi-Turn | Raw JSON leak | 43s | Fail |
| "¿Cuál es el deducible para la Opción Estándar?" | Router | Raw JSON + Spanish | 58s | Fail |
| "Rx copays under Standard Option?" | Accuracy | Raw JSON leak | 49s | Fail |
| "Escalate this claim to SIU" | Action | Bad file descriptor | 3s | Crash |
| "Standard Option deductible" | Accuracy | Bad file descriptor | 2s | Crash |
| "Basic specialist visit costs" | Accuracy | Bad file descriptor | 19s | Crash |
| "Standard inpatient copay" | Accuracy | Bad file descriptor | 7s | Crash |
| "Basic OOP max" | Hallucination | Bad file descriptor | 20s | Crash |
| "Rx copays Standard Option" | Accuracy | Slow (121s) | 121s | Slow |
| "Acupuncture under Basic Option" | Hallucination | Slow (87s) | 87s | Slow |
| "OOP max Basic Option" | Accuracy | Slow (74s) | 74s | Slow |
| "Is my claim covered?" | Router | Slow (69s) | 69s | Slow |
| "Fraud indicators for claim" | Router | Insufficient — no claim docs | 9s | Expected |
| "Documents needed for claim" | Router | Insufficient — no process docs | 10s | Expected |
Improvement Plan
Now — Quick Wins
P0: Fix Raw JSON Leak
- Add output guard between retrieval_agent and response_planner
- If output matches
{"node_ids": [...]}, re-run retrieval with selected nodes - Location:
rag_team.py
P1: Fix Bad File Descriptor
- Add mutex / connection pool around DB access
- Or: reduce eval threads to 1 for shared-DB experiments
- Location:
run_claims_knowledge_rag
Next — Medium Effort
P2: Reduce Latency for Complex Queries
- Cache GraphRAG frontier results per document
- Pre-compute expert routing hints for common patterns
- Reduce max_nodes in
_llm_pick_nodes - Add query-level timeout with keyword search fallback
P3: Spanish Language Support
- Add query translation in router
- Detect non-English → translate to English for retrieval
- Translate answer back to source language
Later — Large Effort
P4: Expand Document Coverage
- Index process SOPs and claim documents
- Not bugs — expected gaps for policy-only index
- Required for process_question and claim_evidence eval suites
Issue Details
| Priority | Issue | Root Cause | Fix | Location | Effort |
|---|---|---|---|---|---|
| P0 | Raw JSON Leak | LLM node selection response ({"node_ids":[...]}) from policy_graphrag.py:860 passes through retrieval_agent as final output. response_planner echoes it verbatim. |
Add output validation between retrieval_agent → response_planner. If output matches node_ids pattern, consume it as retrieval input and re-run the content fetch step. | rag_team.py |
Small |
| P1 | Bad File Descriptor | SQLite DB and PageIndex file handles exhausted when multiple eval threads share the same SqliteDb instance and PageIndex connection concurrently. |
Add threading.Lock around DB access in run_claims_knowledge_rag, or use per-thread DB connections. For evals: reduce threads to 1. |
rag_team.py |
Small |
| P2 | Slow Queries | Multi-hop GraphRAG traversal calls _llm_pick_nodes per level. Complex queries (Rx tiers, multi-section benefits) trigger 3-4 hops with LLM calls each. |
Cache frontier per document; pre-compute routing hints for common patterns; reduce max_nodes; add 30s timeout with keyword fallback. |
policy_graphrag.py |
Medium |
| P3 | Spanish Retrieval | GraphRAG node titles and content are English-only. Spanish queries fail embedding/title match, causing the retrieval agent to return raw JSON or empty results. | Add language detection + translate-to-English step in query_router before retrieval. Translate final answer back to detected language. | rag_team.py |
Medium |
| P4 | Data Gaps | Only BCBS-2026 policy brochure is indexed. Process SOPs and claim artifacts are not ingested. | Ingest process documentation and claim artifacts into PageIndex. Expected gap — not a pipeline bug. | data/ |
Large |