Case Study

Taking an AI Insurance Agent from MVP to Production

An AI insurance claims assistant was ready to ship — or so the team thought. No evals existed. No one knew what was silently failing. We went in, built a rigorous evaluation framework from scratch, and exposed 22 critical issues hiding beneath the surface. Then we fixed every one of them.

Issues Found

Traces Analyzed

+21.2pp

Pass Rate Improvement

<30s

Worst Latency (from 121s)

The Challenge

Every company is building AI. Very few know if it actually works.

This team had an AI-powered insurance claims assistant backed by a GraphRAG pipeline, Neo4j knowledge graph, and multi-agent orchestration. It worked in demos. Leadership wanted to ship it. But no one could answer the basic question: is this thing production-ready? There were no evals, no benchmarks, no systematic way to know what was working and what was quietly failing. That’s where we came in.

From Uncertainty to Confidence

Chapter 1

Uncovering the Truth

The MVP looked good in demos, but nobody had stress-tested it. We ran 52 test cases across 7 experiment suites and surfaced 22 hidden issues — raw JSON leaking to users, silent pipeline crashes, and queries taking over two minutes.

Read chapter

Chapter 2

Building the Eval Stack

You can’t fix what you can’t measure. We designed a production-grade evaluation framework with deterministic checks, custom LLM judges, and RAGAS faithfulness scoring — giving the team a repeatable way to know exactly where their system stands.

Read chapter

Chapter 3

Deep Research

Evals showed the symptoms — but we needed to find the root cause. Deep research into the GraphRAG pipeline revealed the real problem: insurance benefit tables were being destroyed during PDF extraction, so dollar amounts never made it into the knowledge graph.

Read chapter

Chapter 4

Production-Ready

With every critical issue identified and fixed — JSON leak guards, thread-safe infrastructure, latency tuning, multilingual support — the system went from an uncertain MVP to a production-ready AI agent at 95.9% reliability.

Read chapter

Built With

LangWatch Neo4j GraphRAG Agno Agents ChromaDB