Case Study
An AI insurance claims assistant was ready to ship — or so the team thought. No evals existed. No one knew what was silently failing. We went in, built a rigorous evaluation framework from scratch, and exposed 22 critical issues hiding beneath the surface. Then we fixed every one of them.
The Challenge
Every company is building AI. Very few know if it actually works.
This team had an AI-powered insurance claims assistant backed by a GraphRAG pipeline, Neo4j knowledge graph, and multi-agent orchestration. It worked in demos. Leadership wanted to ship it. But no one could answer the basic question: is this thing production-ready? There were no evals, no benchmarks, no systematic way to know what was working and what was quietly failing. That’s where we came in.
Built With