Root Cause Research
Synthesized findings from PageIndex, Microsoft GraphRAG, and modern RAG techniques
Current Eval Results vs Projected
Current Architecture: What Works vs What's Missing
Already Built
- PageIndex hierarchical tree (GPT-powered structural index)
- Neo4j graph with PolicyDocument + PageIndexNode
- PolicyFact data model with full provenance
- PolicyEntity + PolicyRelationship extraction schemas
- Lazy enrichment queue (SQLite-backed, resilient)
- Hybrid retrieval: graph traversal + ChromaDB vector + RRF merge
- Community summarization
- Numeric signal detection (regex: $, %, copay, deductible)
- Layout-aware text resolver (PyMuPDF blocks)
- PageIndex tree search planner + node scoring
Missing / Broken
- Table extraction — PyMuPDF linearizes tables, destroying column associations
- PolicyFact generation — requires LangExtract (3+ hours, skipped)
- Cell-to-fact linking — "$250" not associated with "Standard deductible"
- Text2Cypher — no direct structured queries for dollar amounts
- Full-text index in Neo4j — exact term matching for "$250 copay"
- Table-aware chunking — tables split across chunks
- VLM extraction for Summary of Benefits pages only
Improvement Strategies (Ranked by ROI)
1. PyMuPDF Table Extraction + Table-to-Triplet Conversion
PyMuPDF ≥1.23 includes page.find_tables() which returns structured rows/columns. Your project already has pymupdf≥1.26.4. Extract tables during _read_pdf_page_layouts(), convert each cell to a triplet: (row_header, column_header, cell_value), and store as PolicyFact records.
Example: Table cell → ("Office Visit", "Basic Option Copay", "$30") → PolicyFact with plan_option=basic, linked to the PageIndex dental benefits node.
Why it's #1: Zero LLM cost, runs in seconds, your pymupdf version already supports it, and the PolicyFact + Neo4j infrastructure is already built. This alone could fix 60-70% of the accuracy gap.
2. Targeted VLM Extraction on Summary of Benefits Pages Only
Instead of running LangExtract on the entire document (3+ hours), identify the 5-10 Summary of Benefits pages via PageIndex node titles, render as images with page.get_pixmap(), and send to Gemini Flash / Claude with a structured extraction prompt. Target only the high-value pages.
Why it works: Your PageIndex tree already tags these sections. The extraction prompt can use structured output (BenefitRow schema). 5-10 pages × Gemini Flash = ~30 seconds total. Feeds directly into the existing PolicyFact pipeline.
3. Text2Cypher for Structured Lookups
Add a retrieval mode where the LLM translates natural language into Cypher queries against your graph schema. "What is the deductible for Standard?" becomes MATCH (f:PolicyFact) WHERE f.fact_class='cost_share_rule' AND f.plan_option='standard' AND f.title CONTAINS 'deductible' RETURN f.content.
Requires: Strategies 1 or 2 first (need structured data in the graph). Then expose the schema to the router agent as a tool. Your query router already classifies intent — route policy_question with numeric keywords to Text2Cypher.
4. Neo4j Full-Text Index for Exact Term Matching
Create a Neo4j full-text index on PolicyFact content and normalized_text. Catches exact matches like "$250" or "20% coinsurance" that vector/semantic search may rank lower. Add as a third retrieval signal in your existing RRF merge.
One Cypher command: CREATE FULLTEXT INDEX policyFactContent FOR (f:PolicyFact) ON EACH [f.content, f.normalized_text, f.title]
5. Contextual Retrieval Enrichment (Anthropic Pattern)
Before embedding each PolicyFact, prepend a context sentence: "This fact describes dental benefits under the Standard Option plan." Disambiguates facts that look similar in embedding space. Already referenced in your PageIndex optimization plan.
Implementation: One LLM call per fact during ingestion. Add the context prefix to the vector store document. Your plan already references this in section 4.
6. Microsoft GraphRAG DRIFT-Style Iterative Retrieval
DRIFT (Dynamic Reasoning and Inference with Flexible Traversal) starts with community summaries for initial context, then iteratively refines by following entity links. Your optimization plan already describes this as "drift reasoning" mode.
Best for: Complex multi-hop questions like "Am I covered for an MRI if referred by an out-of-network provider?" that require combining benefit rules + provider rules + authorization requirements.
7. Graph-of-Tables Schema
Represent tables as first-class graph structures: Table → Row → Cell nodes with Column Header schema. Enables Cypher like: MATCH (t:Table)-[:HAS_ROW]->(r)-[:HAS_CELL]->(c) WHERE c.column='Basic Copay' AND r.service='Office Visit' RETURN c.value
Maximum power but maximum effort. Consider this for Phase 2 if simpler approaches don't reach 9+/10 accuracy.
Implementation Priority Matrix
| Priority | Strategy | Impact | Complexity | LLM Cost | Time | Depends On |
|---|---|---|---|---|---|---|
| 1 | PyMuPDF table extraction + triplets | HIGH | Low | $0 | 2-4 hrs | None |
| 2 | VLM extraction on SoB pages only | HIGH | Low-Med | ~$0.02/doc | 3-5 hrs | PageIndex titles |
| 3 | Neo4j full-text index | MED | Low | $0 | 1 hr | #1 or #2 |
| 4 | Text2Cypher structured lookups | HIGH | Med | Per query | 4-6 hrs | #1 or #2 |
| 5 | Contextual retrieval enrichment | MED | Low | Per fact | 2 hrs | #1 or #2 |
| 6 | DRIFT iterative retrieval | MED | Med | Per query | 6-8 hrs | Community summaries |
| 7 | Graph-of-Tables schema | HIGH | High | $0 | 2-3 days | #1 |