Case Study

Taking an AI Insurance Agent from MVP to Production

An AI insurance claims assistant was ready to ship — or so the team thought. No evals existed. No one knew what was silently failing. We went in, built a rigorous evaluation framework from scratch, and exposed 22 critical issues hiding beneath the surface. Then we fixed every one of them.

22
Issues Found
87
Traces Analyzed
+21.2pp
Pass Rate Improvement
<30s
Worst Latency (from 121s)

The Challenge

Every company is building AI. Very few know if it actually works.

This team had an AI-powered insurance claims assistant backed by a GraphRAG pipeline, Neo4j knowledge graph, and multi-agent orchestration. It worked in demos. Leadership wanted to ship it. But no one could answer the basic question: is this thing production-ready? There were no evals, no benchmarks, no systematic way to know what was working and what was quietly failing. That’s where we came in.

From Uncertainty to Confidence

Built With

LangWatch Neo4j GraphRAG Agno Agents ChromaDB