Discovery — Uncovering 22 Issues

Overview

Total Traces

Issues Found

Clean Runs

74.7%

Pass Rate

121s

Worst Latency

Issue Distribution

Latency Distribution (seconds)

Issue Breakdown

Raw JSON Leak — node_ids as output

The LLM's {"node_ids": [...]} selection response from policy_graphrag.py:860 leaks through as the final answer. The response_planner echoes the retrieval agent's intermediate output instead of consuming it.

"What about PCP visits?" (multi-turn)
"And for Basic Option?" (multi-turn)
"¿Cuál es el deducible...?" (Spanish)
"Rx copays under Standard Option?"

Bad File Descriptor — pipeline crash

SQLite / PageIndex file handles exhausted under parallel execution (threads=2). Pipeline returns [Errno 9] Bad file descriptor with no answer.

"Escalate this claim to SIU"
"Standard Option deductible"
"Basic specialist visit costs"
"Standard inpatient copay"
"Basic OOP max"

Slow Queries — over 60 seconds

Multi-hop GraphRAG traversal with LLM node selection is slow on complex or multi-section queries. Worst case: 121s for Rx copays.

Rx copays: 121s
Acupuncture coverage: 87s
OOP max Basic: 74s
Spanish deductible: 58s
4 more at 60-69s

Spanish Language Retrieval

GraphRAG node titles are in English. Spanish queries get raw JSON or insufficient answers because the embedding/title matching fails cross-lingually.

Missing Data Sources

Process questions and claim-specific queries return "unable to" because only the policy brochure is indexed. Not a bug — expected for the current policy-only index.

Affected Queries

Query	Suite	Issue	Latency	Status
"What about PCP visits?"	Multi-Turn	Raw JSON leak	60s	Fail
"And for Basic Option?"	Multi-Turn	Raw JSON leak	43s	Fail
"¿Cuál es el deducible para la Opción Estándar?"	Router	Raw JSON + Spanish	58s	Fail
"Rx copays under Standard Option?"	Accuracy	Raw JSON leak	49s	Fail
"Escalate this claim to SIU"	Action	Bad file descriptor	3s	Crash
"Standard Option deductible"	Accuracy	Bad file descriptor	2s	Crash
"Basic specialist visit costs"	Accuracy	Bad file descriptor	19s	Crash
"Standard inpatient copay"	Accuracy	Bad file descriptor	7s	Crash
"Basic OOP max"	Hallucination	Bad file descriptor	20s	Crash
"Rx copays Standard Option"	Accuracy	Slow (121s)	121s	Slow
"Acupuncture under Basic Option"	Hallucination	Slow (87s)	87s	Slow
"OOP max Basic Option"	Accuracy	Slow (74s)	74s	Slow
"Is my claim covered?"	Router	Slow (69s)	69s	Slow
"Fraud indicators for claim"	Router	Insufficient — no claim docs	9s	Expected
"Documents needed for claim"	Router	Insufficient — no process docs	10s	Expected

Improvement Plan

Now — Quick Wins

P0: Fix Raw JSON Leak

Small 4 failures

Add output guard between retrieval_agent and response_planner
If output matches {"node_ids": [...]}, re-run retrieval with selected nodes
Location: rag_team.py

P1: Fix Bad File Descriptor

Small 5 crashes

Add mutex / connection pool around DB access
Or: reduce eval threads to 1 for shared-DB experiments
Location: run_claims_knowledge_rag

Next — Medium Effort

P2: Reduce Latency for Complex Queries

Medium 9 slow queries

Cache GraphRAG frontier results per document
Pre-compute expert routing hints for common patterns
Reduce max_nodes in _llm_pick_nodes
Add query-level timeout with keyword search fallback

P3: Spanish Language Support

Medium 2 failures

Add query translation in router
Detect non-English → translate to English for retrieval
Translate answer back to source language

Later — Large Effort

P4: Expand Document Coverage

Large 2 expected gaps

Index process SOPs and claim documents
Not bugs — expected gaps for policy-only index
Required for process_question and claim_evidence eval suites

Issue Details

Priority	Issue	Root Cause	Fix	Location	Effort
P0	Raw JSON Leak	LLM node selection response (`{"node_ids":[...]}`) from `policy_graphrag.py:860` passes through retrieval_agent as final output. response_planner echoes it verbatim.	Add output validation between retrieval_agent → response_planner. If output matches node_ids pattern, consume it as retrieval input and re-run the content fetch step.	`rag_team.py`	Small
P1	Bad File Descriptor	SQLite DB and PageIndex file handles exhausted when multiple eval threads share the same `SqliteDb` instance and PageIndex connection concurrently.	Add `threading.Lock` around DB access in `run_claims_knowledge_rag`, or use per-thread DB connections. For evals: reduce threads to 1.	`rag_team.py`	Small
P2	Slow Queries	Multi-hop GraphRAG traversal calls `_llm_pick_nodes` per level. Complex queries (Rx tiers, multi-section benefits) trigger 3-4 hops with LLM calls each.	Cache frontier per document; pre-compute routing hints for common patterns; reduce `max_nodes`; add 30s timeout with keyword fallback.	`policy_graphrag.py`	Medium
P3	Spanish Retrieval	GraphRAG node titles and content are English-only. Spanish queries fail embedding/title match, causing the retrieval agent to return raw JSON or empty results.	Add language detection + translate-to-English step in query_router before retrieval. Translate final answer back to detected language.	`rag_team.py`	Medium
P4	Data Gaps	Only BCBS-2026 policy brochure is indexed. Process SOPs and claim artifacts are not ingested.	Ingest process documentation and claim artifacts into PageIndex. Expected gap — not a pipeline bug.	`data/`	Large