Overview
87
Total Traces
22
Issues Found
65
Clean Runs
74.7%
Pass Rate
121s
Worst Latency
Issue Distribution
Latency Distribution (seconds)
Issue Breakdown
P0 4 occurrences
Raw JSON Leak — node_ids as output
The LLM's {"node_ids": [...]} selection response from policy_graphrag.py:860 leaks through as the final answer. The response_planner echoes the retrieval agent's intermediate output instead of consuming it.
  • "What about PCP visits?" (multi-turn)
  • "And for Basic Option?" (multi-turn)
  • "¿Cuál es el deducible...?" (Spanish)
  • "Rx copays under Standard Option?"
P1 5 occurrences
Bad File Descriptor — pipeline crash
SQLite / PageIndex file handles exhausted under parallel execution (threads=2). Pipeline returns [Errno 9] Bad file descriptor with no answer.
  • "Escalate this claim to SIU"
  • "Standard Option deductible"
  • "Basic specialist visit costs"
  • "Standard inpatient copay"
  • "Basic OOP max"
P2 9 occurrences
Slow Queries — over 60 seconds
Multi-hop GraphRAG traversal with LLM node selection is slow on complex or multi-section queries. Worst case: 121s for Rx copays.
  • Rx copays: 121s
  • Acupuncture coverage: 87s
  • OOP max Basic: 74s
  • Spanish deductible: 58s
  • 4 more at 60-69s
P3 2 failures
Spanish Language Retrieval
GraphRAG node titles are in English. Spanish queries get raw JSON or insufficient answers because the embedding/title matching fails cross-lingually.
P4 2 expected gaps
Missing Data Sources
Process questions and claim-specific queries return "unable to" because only the policy brochure is indexed. Not a bug — expected for the current policy-only index.
Affected Queries
Query Suite Issue Latency Status
"What about PCP visits?"Multi-TurnRaw JSON leak60sFail
"And for Basic Option?"Multi-TurnRaw JSON leak43sFail
"¿Cuál es el deducible para la Opción Estándar?"RouterRaw JSON + Spanish58sFail
"Rx copays under Standard Option?"AccuracyRaw JSON leak49sFail
"Escalate this claim to SIU"ActionBad file descriptor3sCrash
"Standard Option deductible"AccuracyBad file descriptor2sCrash
"Basic specialist visit costs"AccuracyBad file descriptor19sCrash
"Standard inpatient copay"AccuracyBad file descriptor7sCrash
"Basic OOP max"HallucinationBad file descriptor20sCrash
"Rx copays Standard Option"AccuracySlow (121s)121sSlow
"Acupuncture under Basic Option"HallucinationSlow (87s)87sSlow
"OOP max Basic Option"AccuracySlow (74s)74sSlow
"Is my claim covered?"RouterSlow (69s)69sSlow
"Fraud indicators for claim"RouterInsufficient — no claim docs9sExpected
"Documents needed for claim"RouterInsufficient — no process docs10sExpected
Improvement Plan
Now — Quick Wins
P0: Fix Raw JSON Leak
Small 4 failures
  • Add output guard between retrieval_agent and response_planner
  • If output matches {"node_ids": [...]}, re-run retrieval with selected nodes
  • Location: rag_team.py
P1: Fix Bad File Descriptor
Small 5 crashes
  • Add mutex / connection pool around DB access
  • Or: reduce eval threads to 1 for shared-DB experiments
  • Location: run_claims_knowledge_rag
Next — Medium Effort
P2: Reduce Latency for Complex Queries
Medium 9 slow queries
  • Cache GraphRAG frontier results per document
  • Pre-compute expert routing hints for common patterns
  • Reduce max_nodes in _llm_pick_nodes
  • Add query-level timeout with keyword search fallback
P3: Spanish Language Support
Medium 2 failures
  • Add query translation in router
  • Detect non-English → translate to English for retrieval
  • Translate answer back to source language
Later — Large Effort
P4: Expand Document Coverage
Large 2 expected gaps
  • Index process SOPs and claim documents
  • Not bugs — expected gaps for policy-only index
  • Required for process_question and claim_evidence eval suites
Issue Details
Priority Issue Root Cause Fix Location Effort
P0 Raw JSON Leak LLM node selection response ({"node_ids":[...]}) from policy_graphrag.py:860 passes through retrieval_agent as final output. response_planner echoes it verbatim. Add output validation between retrieval_agent → response_planner. If output matches node_ids pattern, consume it as retrieval input and re-run the content fetch step. rag_team.py Small
P1 Bad File Descriptor SQLite DB and PageIndex file handles exhausted when multiple eval threads share the same SqliteDb instance and PageIndex connection concurrently. Add threading.Lock around DB access in run_claims_knowledge_rag, or use per-thread DB connections. For evals: reduce threads to 1. rag_team.py Small
P2 Slow Queries Multi-hop GraphRAG traversal calls _llm_pick_nodes per level. Complex queries (Rx tiers, multi-section benefits) trigger 3-4 hops with LLM calls each. Cache frontier per document; pre-compute routing hints for common patterns; reduce max_nodes; add 30s timeout with keyword fallback. policy_graphrag.py Medium
P3 Spanish Retrieval GraphRAG node titles and content are English-only. Spanish queries fail embedding/title match, causing the retrieval agent to return raw JSON or empty results. Add language detection + translate-to-English step in query_router before retrieval. Translate final answer back to detected language. rag_team.py Medium
P4 Data Gaps Only BCBS-2026 policy brochure is indexed. Process SOPs and claim artifacts are not ingested. Ingest process documentation and claim artifacts into PageIndex. Expected gap — not a pipeline bug. data/ Large