Root Cause: The Extraction Gap

The accuracy problem (3/10 evals) is not a retrieval problem — it's an extraction problem.

PageIndex is structural only — it builds a hierarchical table of contents. It does NOT extract text, tables, or dollar amounts. The text layer uses PyMuPDF get_text("blocks") which destroys table structure, turning multi-column benefit tables into interleaved garbage text.

The LangExtract path (Gemini vision) does extract structured PolicyFacts with dollar amounts, but takes 3+ hours. Without it, the graph has structure but no granular data.

PDF Input
PageIndex Tree
Neo4j Structure
Text Extraction (tables destroyed)
No PolicyFacts (LangExtract skipped)
Missing $ amounts

Current Eval Results vs Projected

3/10
Answer Accuracy
→ 8-9/10 projected
1/7
Hallucination Detection
→ 6-7/7 projected
8/8
Cross-Claim Safety
no change needed
6/7
Prompt Injection
→ 7/7 projected

Current Architecture: What Works vs What's Missing

Already Built

  • PageIndex hierarchical tree (GPT-powered structural index)
  • Neo4j graph with PolicyDocument + PageIndexNode
  • PolicyFact data model with full provenance
  • PolicyEntity + PolicyRelationship extraction schemas
  • Lazy enrichment queue (SQLite-backed, resilient)
  • Hybrid retrieval: graph traversal + ChromaDB vector + RRF merge
  • Community summarization
  • Numeric signal detection (regex: $, %, copay, deductible)
  • Layout-aware text resolver (PyMuPDF blocks)
  • PageIndex tree search planner + node scoring

Missing / Broken

  • Table extraction — PyMuPDF linearizes tables, destroying column associations
  • PolicyFact generation — requires LangExtract (3+ hours, skipped)
  • Cell-to-fact linking — "$250" not associated with "Standard deductible"
  • Text2Cypher — no direct structured queries for dollar amounts
  • Full-text index in Neo4j — exact term matching for "$250 copay"
  • Table-aware chunking — tables split across chunks
  • VLM extraction for Summary of Benefits pages only

Improvement Strategies (Ranked by ROI)

3. Text2Cypher for Structured Lookups

Impact: HIGH Complexity: MEDIUM

Add a retrieval mode where the LLM translates natural language into Cypher queries against your graph schema. "What is the deductible for Standard?" becomes MATCH (f:PolicyFact) WHERE f.fact_class='cost_share_rule' AND f.plan_option='standard' AND f.title CONTAINS 'deductible' RETURN f.content.

Requires: Strategies 1 or 2 first (need structured data in the graph). Then expose the schema to the router agent as a tool. Your query router already classifies intent — route policy_question with numeric keywords to Text2Cypher.

4. Neo4j Full-Text Index for Exact Term Matching

Impact: MEDIUM Complexity: LOW

Create a Neo4j full-text index on PolicyFact content and normalized_text. Catches exact matches like "$250" or "20% coinsurance" that vector/semantic search may rank lower. Add as a third retrieval signal in your existing RRF merge.

One Cypher command: CREATE FULLTEXT INDEX policyFactContent FOR (f:PolicyFact) ON EACH [f.content, f.normalized_text, f.title]

5. Contextual Retrieval Enrichment (Anthropic Pattern)

Impact: MEDIUM Complexity: LOW

Before embedding each PolicyFact, prepend a context sentence: "This fact describes dental benefits under the Standard Option plan." Disambiguates facts that look similar in embedding space. Already referenced in your PageIndex optimization plan.

Implementation: One LLM call per fact during ingestion. Add the context prefix to the vector store document. Your plan already references this in section 4.

6. Microsoft GraphRAG DRIFT-Style Iterative Retrieval

Impact: MEDIUM Complexity: MEDIUM

DRIFT (Dynamic Reasoning and Inference with Flexible Traversal) starts with community summaries for initial context, then iteratively refines by following entity links. Your optimization plan already describes this as "drift reasoning" mode.

Best for: Complex multi-hop questions like "Am I covered for an MRI if referred by an out-of-network provider?" that require combining benefit rules + provider rules + authorization requirements.

7. Graph-of-Tables Schema

Impact: HIGH Complexity: HIGH

Represent tables as first-class graph structures: Table → Row → Cell nodes with Column Header schema. Enables Cypher like: MATCH (t:Table)-[:HAS_ROW]->(r)-[:HAS_CELL]->(c) WHERE c.column='Basic Copay' AND r.service='Office Visit' RETURN c.value

Maximum power but maximum effort. Consider this for Phase 2 if simpler approaches don't reach 9+/10 accuracy.


Implementation Priority Matrix

Priority Strategy Impact Complexity LLM Cost Time Depends On
1 PyMuPDF table extraction + triplets HIGH Low $0 2-4 hrs None
2 VLM extraction on SoB pages only HIGH Low-Med ~$0.02/doc 3-5 hrs PageIndex titles
3 Neo4j full-text index MED Low $0 1 hr #1 or #2
4 Text2Cypher structured lookups HIGH Med Per query 4-6 hrs #1 or #2
5 Contextual retrieval enrichment MED Low Per fact 2 hrs #1 or #2
6 DRIFT iterative retrieval MED Med Per query 6-8 hrs Community summaries
7 Graph-of-Tables schema HIGH High $0 2-3 days #1

Proposed Ingestion Architecture (with table extraction)

PDF Input | |-- [PageIndex (GPT)] Build hierarchical tree (existing) | \-- Neo4j: PolicyDocument + PageIndexNode | |-- [PyMuPDF find_tables()] Extract structured tables (NEW - Strategy 1) | |-- Convert to triplets: (service, plan, value) | |-- Link to parent PageIndexNode by page range | \-- Store as PolicyFact records in Neo4j | |-- [VLM Targeted Extraction] SoB pages only (~5-10pp) (NEW - Strategy 2) | |-- Identify SoB pages via PageIndex node titles | |-- Render as images -- Gemini Flash / Claude | |-- Structured output: BenefitRow[] | \-- Store as PolicyFact records (higher confidence) | \-- [Layout-Aware Text Resolver] Block-level text (existing) \-- Numeric signal detection, deterministic tagging Retrieval (enhanced) | |-- [PageIndex Tree Planner] Structural narrowing (existing) |-- [Graph Fact Retrieval] PolicyFact via Cypher (existing - now has table data) |-- [Neo4j Full-Text Index] Exact term matching (NEW - Strategy 4) |-- [ChromaDB Vector Search] Semantic similarity (existing) \-- [RRF Merge] Reciprocal Rank Fusion (existing)

Key Insight from Research

The Fastest Path: PyMuPDF Tables + Targeted VLM

No amount of retrieval improvement will find dollar amounts that were never extracted.

Strategy 1 (PyMuPDF find_tables()) costs zero LLM tokens and runs in seconds. It handles well-formatted tables with ruled lines — which is how most insurance SoB tables are formatted.

Strategy 2 (targeted VLM) handles the edge cases: decorative layouts, image-based tables, merged cells. Running on just 5-10 SoB pages takes ~30 seconds with Gemini Flash vs 3+ hours for full LangExtract.

Combined, these two strategies address the root cause directly. Everything else (Text2Cypher, full-text index, DRIFT) improves retrieval of data that's already in the graph — important, but secondary.

Your existing PolicyFact model, enrichment queue, and Neo4j graph are the right storage layer. The extracted table triplets plug directly into policy_fact_builder.py.

Research sources: VectifyAI/PageIndex repo (vendored), Microsoft GraphRAG (github.com/microsoft/graphrag), Neo4j GraphRAG patterns, Anthropic contextual retrieval, project docs (pageindex-policy-search-optimization-plan.md, langextract-graphrag-plan.md). Codebase analysis: aegisclaim-extract, aegisclaim-graph, policy_graphrag.py.