Chapter 3: Root Cause Research

Root Cause: The Extraction Gap

The accuracy problem (3/10 evals) is not a retrieval problem — it's an extraction problem.

PageIndex is structural only — it builds a hierarchical table of contents. It does NOT extract text, tables, or dollar amounts. The text layer uses PyMuPDF get_text("blocks") which destroys table structure, turning multi-column benefit tables into interleaved garbage text.

The LangExtract path (Gemini vision) does extract structured PolicyFacts with dollar amounts, but takes 3+ hours. Without it, the graph has structure but no granular data.

PDF Input

→

PageIndex Tree

→

Neo4j Structure

→

Text Extraction (tables destroyed)

→

No PolicyFacts (LangExtract skipped)

→

Missing $ amounts

Current Eval Results vs Projected

3/10

Answer Accuracy

→ 8-9/10 projected

1/7

Hallucination Detection

→ 6-7/7 projected

8/8

Cross-Claim Safety

no change needed

6/7

Prompt Injection

→ 7/7 projected

Current Architecture: What Works vs What's Missing

Already Built

PageIndex hierarchical tree (GPT-powered structural index)
Neo4j graph with PolicyDocument + PageIndexNode
PolicyFact data model with full provenance
PolicyEntity + PolicyRelationship extraction schemas
Lazy enrichment queue (SQLite-backed, resilient)
Hybrid retrieval: graph traversal + ChromaDB vector + RRF merge
Community summarization
Numeric signal detection (regex: $, %, copay, deductible)
Layout-aware text resolver (PyMuPDF blocks)
PageIndex tree search planner + node scoring

Missing / Broken

Table extraction — PyMuPDF linearizes tables, destroying column associations
PolicyFact generation — requires LangExtract (3+ hours, skipped)
Cell-to-fact linking — "$250" not associated with "Standard deductible"
Text2Cypher — no direct structured queries for dollar amounts
Full-text index in Neo4j — exact term matching for "$250 copay"
Table-aware chunking — tables split across chunks
VLM extraction for Summary of Benefits pages only

Improvement Strategies (Ranked by ROI)

Recommended

1. PyMuPDF Table Extraction + Table-to-Triplet Conversion

Impact: HIGH Complexity: LOW No LLM cost

PyMuPDF ≥1.23 includes page.find_tables() which returns structured rows/columns. Your project already has pymupdf≥1.26.4. Extract tables during _read_pdf_page_layouts(), convert each cell to a triplet: (row_header, column_header, cell_value), and store as PolicyFact records.

Example: Table cell → ("Office Visit", "Basic Option Copay", "$30") → PolicyFact with plan_option=basic, linked to the PageIndex dental benefits node.

Why it's #1: Zero LLM cost, runs in seconds, your pymupdf version already supports it, and the PolicyFact + Neo4j infrastructure is already built. This alone could fix 60-70% of the accuracy gap.

Recommended

2. Targeted VLM Extraction on Summary of Benefits Pages Only

Impact: HIGH Complexity: LOW-MED ~5min vs 3hr

Instead of running LangExtract on the entire document (3+ hours), identify the 5-10 Summary of Benefits pages via PageIndex node titles, render as images with page.get_pixmap(), and send to Gemini Flash / Claude with a structured extraction prompt. Target only the high-value pages.

Why it works: Your PageIndex tree already tags these sections. The extraction prompt can use structured output (BenefitRow schema). 5-10 pages × Gemini Flash = ~30 seconds total. Feeds directly into the existing PolicyFact pipeline.

3. Text2Cypher for Structured Lookups

Impact: HIGH Complexity: MEDIUM

Add a retrieval mode where the LLM translates natural language into Cypher queries against your graph schema. "What is the deductible for Standard?" becomes MATCH (f:PolicyFact) WHERE f.fact_class='cost_share_rule' AND f.plan_option='standard' AND f.title CONTAINS 'deductible' RETURN f.content.

Requires: Strategies 1 or 2 first (need structured data in the graph). Then expose the schema to the router agent as a tool. Your query router already classifies intent — route policy_question with numeric keywords to Text2Cypher.

4. Neo4j Full-Text Index for Exact Term Matching

Impact: MEDIUM Complexity: LOW

Create a Neo4j full-text index on PolicyFact content and normalized_text. Catches exact matches like "$250" or "20% coinsurance" that vector/semantic search may rank lower. Add as a third retrieval signal in your existing RRF merge.

One Cypher command: CREATE FULLTEXT INDEX policyFactContent FOR (f:PolicyFact) ON EACH [f.content, f.normalized_text, f.title]

5. Contextual Retrieval Enrichment (Anthropic Pattern)

Impact: MEDIUM Complexity: LOW

Before embedding each PolicyFact, prepend a context sentence: "This fact describes dental benefits under the Standard Option plan." Disambiguates facts that look similar in embedding space. Already referenced in your PageIndex optimization plan.

Implementation: One LLM call per fact during ingestion. Add the context prefix to the vector store document. Your plan already references this in section 4.

6. Microsoft GraphRAG DRIFT-Style Iterative Retrieval

Impact: MEDIUM Complexity: MEDIUM

DRIFT (Dynamic Reasoning and Inference with Flexible Traversal) starts with community summaries for initial context, then iteratively refines by following entity links. Your optimization plan already describes this as "drift reasoning" mode.

Best for: Complex multi-hop questions like "Am I covered for an MRI if referred by an out-of-network provider?" that require combining benefit rules + provider rules + authorization requirements.

7. Graph-of-Tables Schema

Impact: HIGH Complexity: HIGH

Represent tables as first-class graph structures: Table → Row → Cell nodes with Column Header schema. Enables Cypher like: MATCH (t:Table)-[:HAS_ROW]->(r)-[:HAS_CELL]->(c) WHERE c.column='Basic Copay' AND r.service='Office Visit' RETURN c.value

Maximum power but maximum effort. Consider this for Phase 2 if simpler approaches don't reach 9+/10 accuracy.

Implementation Priority Matrix

Priority	Strategy	Impact	Complexity	LLM Cost	Time	Depends On
1	PyMuPDF table extraction + triplets	HIGH	Low	$0	2-4 hrs	None
2	VLM extraction on SoB pages only	HIGH	Low-Med	~$0.02/doc	3-5 hrs	PageIndex titles
3	Neo4j full-text index	MED	Low	$0	1 hr	#1 or #2
4	Text2Cypher structured lookups	HIGH	Med	Per query	4-6 hrs	#1 or #2
5	Contextual retrieval enrichment	MED	Low	Per fact	2 hrs	#1 or #2
6	DRIFT iterative retrieval	MED	Med	Per query	6-8 hrs	Community summaries
7	Graph-of-Tables schema	HIGH	High	$0	2-3 days	#1

Proposed Ingestion Architecture (with table extraction)

PDF Input | |-- [PageIndex (GPT)] Build hierarchical tree (existing) | \-- Neo4j: PolicyDocument + PageIndexNode | |-- [PyMuPDF find_tables()] Extract structured tables (NEW - Strategy 1) | |-- Convert to triplets: (service, plan, value) | |-- Link to parent PageIndexNode by page range | \-- Store as PolicyFact records in Neo4j | |-- [VLM Targeted Extraction] SoB pages only (~5-10pp) (NEW - Strategy 2) | |-- Identify SoB pages via PageIndex node titles | |-- Render as images -- Gemini Flash / Claude | |-- Structured output: BenefitRow[] | \-- Store as PolicyFact records (higher confidence) | \-- [Layout-Aware Text Resolver] Block-level text (existing) \-- Numeric signal detection, deterministic tagging Retrieval (enhanced) | |-- [PageIndex Tree Planner] Structural narrowing (existing) |-- [Graph Fact Retrieval] PolicyFact via Cypher (existing - now has table data) |-- [Neo4j Full-Text Index] Exact term matching (NEW - Strategy 4) |-- [ChromaDB Vector Search] Semantic similarity (existing) \-- [RRF Merge] Reciprocal Rank Fusion (existing)

Key Insight from Research

The Fastest Path: PyMuPDF Tables + Targeted VLM

No amount of retrieval improvement will find dollar amounts that were never extracted.

Strategy 1 (PyMuPDF find_tables()) costs zero LLM tokens and runs in seconds. It handles well-formatted tables with ruled lines — which is how most insurance SoB tables are formatted.

Strategy 2 (targeted VLM) handles the edge cases: decorative layouts, image-based tables, merged cells. Running on just 5-10 SoB pages takes ~30 seconds with Gemini Flash vs 3+ hours for full LangExtract.

Combined, these two strategies address the root cause directly. Everything else (Text2Cypher, full-text index, DRIFT) improves retrieval of data that's already in the graph — important, but secondary.

Your existing PolicyFact model, enrichment queue, and Neo4j graph are the right storage layer. The extracted table triplets plug directly into policy_fact_builder.py.