Eval Framework — DNAMIC Case Study

Stack Overview

All evals run from a single Jupyter notebook (evals/eval_langwatch.ipynb) using the LangWatch SDK. Each eval suite initializes its own experiment via langwatch.experiment.init(), iterates over fixtures with experiment.loop(), and logs metrics with experiment.log(). The Agno instrumentor captures all nested LLM/tool calls as spans.

Runner

Jupyter + LangWatch SDK

uv run jupyter execute evals/eval_langwatch.ipynb

Instrumentation

Agno + OpenTelemetry

AgnoInstrumentor via langwatch.setup()

Built-in Evaluator

ragas/faithfulness

Server-side via OpenAI gpt-4.1-mini

Custom Judges

2 LLM Judges

Accuracy scorer + Hallucination detector

LangWatch Dashboard

The LangWatch dashboard provides a live view of experiment results, including per-case metrics, latency distributions, pass/fail rates, and RAGAS faithfulness scores — all captured from a single notebook execution.

LangWatch experiments dashboard showing the hallucination-detection experiment with key latency, faithfulness, pass rate, and grounded metrics across 7 test cases — **LangWatch Experiments Dashboard** — Hallucination detection experiment showing per-case results with latency, RAGAS faithfulness, pass/fail status, and grounded scores

Pipeline Flow

The notebook executes 7 experiments sequentially. Each experiment builds a DataFrame from fixture files, loops over rows with configurable parallelism, runs the RAG pipeline, and logs pass/fail metrics plus optional LLM judge scores.

    graph LR
      subgraph Fixtures
        F1["fixtures_accuracy.py"]
        F2["fixtures_safety.py"]
        F3["fixtures_tier2.py"]
      end

      subgraph Notebook["eval_langwatch.ipynb"]
        SETUP["langwatch.setup + AgnoInstrumentor"]
        PIPE["Build RAG Pipeline"]
        LOOP["experiment.loop over DataFrame"]
      end

      subgraph Pipeline["RAG Pipeline"]
        QR["Query Router"]
        RA["Retrieval Agent"]
        EV["Evidence Verifier"]
        RP["Response Planner"]
      end

      subgraph Scoring
        DET["Deterministic Checks"]
        LLM["LLM Judge"]
        RAGAS["ragas/faithfulness"]
      end

      subgraph Output
        LOG["experiment.log"]
        LW["LangWatch Dashboard"]
      end

      F1 --> LOOP
      F2 --> LOOP
      F3 --> LOOP
      SETUP --> PIPE
      PIPE --> QR
      LOOP --> QR
      QR --> RA
      RA --> EV
      EV --> RP
      RP --> DET
      RP --> LLM
      RP --> RAGAS
      DET --> LOG
      LLM --> LOG
      RAGAS --> LOG
      LOG --> LW

      classDef fixture fill:var(--mm-coral-bg),stroke:var(--mm-coral),stroke-width:2px
      classDef notebook fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:2px
      classDef pipe fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:1.5px
      classDef scoring fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:1.5px
      classDef output fill:var(--mm-amber-bg),stroke:var(--mm-amber),stroke-width:2px

      class F1,F2,F3 fixture
      class SETUP,PIPE,LOOP notebook
      class QR,RA,EV,RP pipe
      class DET,LLM,RAGAS scoring
      class LOG,LW output

Fixtures

Notebook

RAG Pipeline

Scoring

Output

Eval Strategy Map

Each experiment targets a specific layer of the RAG pipeline. Tier 1 evals (accuracy, hallucination, safety) catch critical regressions. Tier 2 evals (router, actions, multi-turn) test robustness and edge cases. The ordering runs cheapest/fastest first.

    graph TD
      subgraph tier1["Tier 1 — Critical"]
        E3["3. Cross-Claim Safety"]
        E4["4. Prompt Injection"]
        E1["1. Answer Accuracy"]
        E2["2. Hallucination Detection"]
      end

      subgraph tier2["Tier 2 — Robustness"]
        E5["5. Router Edge Cases"]
        E6["6. Action Requests"]
        E7["7. Multi-Turn Follow-Ups"]
      end

      subgraph layers["Pipeline Layer Tested"]
        L1["Input Guardrail"]
        L2["Query Router"]
        L3["Retrieval + GraphRAG"]
        L4["Evidence Verifier"]
        L5["Response Planner"]
        L6["Action Proposals"]
        L7["Session Context"]
      end

      E4 -.-> L1
      E5 -.-> L2
      E1 -.-> L3
      E2 -.-> L4
      E3 -.-> L4
      E1 -.-> L5
      E6 -.-> L6
      E7 -.-> L7

      classDef t1 fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:2px
      classDef t2 fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:2px
      classDef layer fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:1.5px

      class E1,E2,E3,E4 t1
      class E5,E6,E7 t2
      class L1,L2,L3,L4,L5,L6,L7 layer

Tier 1 -- Critical

Tier 2 -- Robustness

Pipeline Layer

Experiment Matrix

#	Experiment	Cases	Scoring	Threads	Metrics Logged	Fixture Source
1	`answer-accuracy`	10	Det + LLM	2	`value_match` (score + pass), `judge` (score), `pass` Critical values must appear in text AND/OR judge score ≥ 7	`fixtures_accuracy.py` → `ACCURACY_CASES`
2	`hallucination-detection`	7	LLM + RAGAS	2	`grounded`, `structural`, `pass`, `ragas/faithfulness` Custom judge + built-in RAGAS evaluator for dual signal	`fixtures_accuracy.py` → `HALLUCINATION_CASES`
3	`cross-claim-safety`	8	Deterministic	2	`status_conflict`, `confidence_zero`, `human_review`, `audit_event`, `pass` All 4 checks must pass for blocked cases	`fixtures_safety.py` → `CROSS_CLAIM_CASES`
4	`prompt-injection-safety`	7	Deterministic	1	`blocked`, `pass` Suite A: input guardrail exception. Suite B: pipeline audit events	`fixtures_safety.py` → `ALL_INJECTION_CASES`
5	`router-edge-cases`	10	Deterministic	2	`intent` (score + pass) Intent enum must match expected. Includes 3 Spanish queries	`fixtures_tier2.py` → `ROUTER_CASES`
6	`action-requests`	4	Deterministic	2	`has_proposal`, `requires_approval`, `action_type`, `has_rationale`, `has_name`, `pass` HITL gating: every action must require human approval	`fixtures_tier2.py` → `ACTION_REQUEST_CASES`
7	`multi-turn-follow-ups`	6	Deterministic	1	`intent`, `contains:*`, `contains_any`, `turn_pass` Sequential execution. Fresh pipeline per conversation	`fixtures_tier2.py` → `MULTI_TURN_CASES`

Scoring Strategy

Three scoring layers, applied per experiment:

1. Deterministic checks — Python assertions on structured RagTeamResult fields (evidence_status, confidence, intent, audit_events). Zero cost, instant, reproducible. Used by all 7 experiments.

2. Custom LLM judges — Two domain-specific judges built with Agno Agent: an accuracy scorer (1-10 scale against ground-truth dollar values) and a hallucination detector (JSON-structured grounding assessment). Used by experiments 1 and 2.

3. Built-in RAGAS evaluator — ragas/faithfulness runs server-side on LangWatch via experiment.evaluate(). Provides a standardized faithfulness score as a dual signal alongside the custom hallucination judge. Used by experiment 2.

Execution Order

Cheapest First

Safety (det) → Accuracy (LLM) → Robustness (det)

Total Est. Cost

~$1.00

52 cases, ~90K tokens

Parallelism

threads=2 / 1

Reduced from 4 to avoid FD exhaustion

CI/CD Ready

Yes

jupyter execute + non-zero exit on failures

The Evaluation Framework