Stack Overview

All evals run from a single Jupyter notebook (evals/eval_langwatch.ipynb) using the LangWatch SDK. Each eval suite initializes its own experiment via langwatch.experiment.init(), iterates over fixtures with experiment.loop(), and logs metrics with experiment.log(). The Agno instrumentor captures all nested LLM/tool calls as spans.

Runner
Jupyter + LangWatch SDK
uv run jupyter execute evals/eval_langwatch.ipynb
Instrumentation
Agno + OpenTelemetry
AgnoInstrumentor via langwatch.setup()
Built-in Evaluator
ragas/faithfulness
Server-side via OpenAI gpt-4.1-mini
Custom Judges
2 LLM Judges
Accuracy scorer + Hallucination detector
LangWatch Dashboard

The LangWatch dashboard provides a live view of experiment results, including per-case metrics, latency distributions, pass/fail rates, and RAGAS faithfulness scores — all captured from a single notebook execution.

LangWatch experiments dashboard showing the hallucination-detection experiment with key latency, faithfulness, pass rate, and grounded metrics across 7 test cases
LangWatch Experiments Dashboard — Hallucination detection experiment showing per-case results with latency, RAGAS faithfulness, pass/fail status, and grounded scores
Pipeline Flow

The notebook executes 7 experiments sequentially. Each experiment builds a DataFrame from fixture files, loops over rows with configurable parallelism, runs the RAG pipeline, and logs pass/fail metrics plus optional LLM judge scores.

    graph LR
      subgraph Fixtures
        F1["fixtures_accuracy.py"]
        F2["fixtures_safety.py"]
        F3["fixtures_tier2.py"]
      end

      subgraph Notebook["eval_langwatch.ipynb"]
        SETUP["langwatch.setup + AgnoInstrumentor"]
        PIPE["Build RAG Pipeline"]
        LOOP["experiment.loop over DataFrame"]
      end

      subgraph Pipeline["RAG Pipeline"]
        QR["Query Router"]
        RA["Retrieval Agent"]
        EV["Evidence Verifier"]
        RP["Response Planner"]
      end

      subgraph Scoring
        DET["Deterministic Checks"]
        LLM["LLM Judge"]
        RAGAS["ragas/faithfulness"]
      end

      subgraph Output
        LOG["experiment.log"]
        LW["LangWatch Dashboard"]
      end

      F1 --> LOOP
      F2 --> LOOP
      F3 --> LOOP
      SETUP --> PIPE
      PIPE --> QR
      LOOP --> QR
      QR --> RA
      RA --> EV
      EV --> RP
      RP --> DET
      RP --> LLM
      RP --> RAGAS
      DET --> LOG
      LLM --> LOG
      RAGAS --> LOG
      LOG --> LW

      classDef fixture fill:var(--mm-coral-bg),stroke:var(--mm-coral),stroke-width:2px
      classDef notebook fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:2px
      classDef pipe fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:1.5px
      classDef scoring fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:1.5px
      classDef output fill:var(--mm-amber-bg),stroke:var(--mm-amber),stroke-width:2px

      class F1,F2,F3 fixture
      class SETUP,PIPE,LOOP notebook
      class QR,RA,EV,RP pipe
      class DET,LLM,RAGAS scoring
      class LOG,LW output
  
Fixtures
Notebook
RAG Pipeline
Scoring
Output
Eval Strategy Map

Each experiment targets a specific layer of the RAG pipeline. Tier 1 evals (accuracy, hallucination, safety) catch critical regressions. Tier 2 evals (router, actions, multi-turn) test robustness and edge cases. The ordering runs cheapest/fastest first.

    graph TD
      subgraph tier1["Tier 1 — Critical"]
        E3["3. Cross-Claim Safety"]
        E4["4. Prompt Injection"]
        E1["1. Answer Accuracy"]
        E2["2. Hallucination Detection"]
      end

      subgraph tier2["Tier 2 — Robustness"]
        E5["5. Router Edge Cases"]
        E6["6. Action Requests"]
        E7["7. Multi-Turn Follow-Ups"]
      end

      subgraph layers["Pipeline Layer Tested"]
        L1["Input Guardrail"]
        L2["Query Router"]
        L3["Retrieval + GraphRAG"]
        L4["Evidence Verifier"]
        L5["Response Planner"]
        L6["Action Proposals"]
        L7["Session Context"]
      end

      E4 -.-> L1
      E5 -.-> L2
      E1 -.-> L3
      E2 -.-> L4
      E3 -.-> L4
      E1 -.-> L5
      E6 -.-> L6
      E7 -.-> L7

      classDef t1 fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:2px
      classDef t2 fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:2px
      classDef layer fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:1.5px

      class E1,E2,E3,E4 t1
      class E5,E6,E7 t2
      class L1,L2,L3,L4,L5,L6,L7 layer
  
Tier 1 -- Critical
Tier 2 -- Robustness
Pipeline Layer
Experiment Matrix
# Experiment Cases Scoring Threads Metrics Logged Fixture Source
1 answer-accuracy 10 Det + LLM 2 value_match (score + pass), judge (score), pass Critical values must appear in text AND/OR judge score ≥ 7 fixtures_accuracy.pyACCURACY_CASES
2 hallucination-detection 7 LLM + RAGAS 2 grounded, structural, pass, ragas/faithfulness Custom judge + built-in RAGAS evaluator for dual signal fixtures_accuracy.pyHALLUCINATION_CASES
3 cross-claim-safety 8 Deterministic 2 status_conflict, confidence_zero, human_review, audit_event, pass All 4 checks must pass for blocked cases fixtures_safety.pyCROSS_CLAIM_CASES
4 prompt-injection-safety 7 Deterministic 1 blocked, pass Suite A: input guardrail exception. Suite B: pipeline audit events fixtures_safety.pyALL_INJECTION_CASES
5 router-edge-cases 10 Deterministic 2 intent (score + pass) Intent enum must match expected. Includes 3 Spanish queries fixtures_tier2.pyROUTER_CASES
6 action-requests 4 Deterministic 2 has_proposal, requires_approval, action_type, has_rationale, has_name, pass HITL gating: every action must require human approval fixtures_tier2.pyACTION_REQUEST_CASES
7 multi-turn-follow-ups 6 Deterministic 1 intent, contains:*, contains_any, turn_pass Sequential execution. Fresh pipeline per conversation fixtures_tier2.pyMULTI_TURN_CASES
Scoring Strategy
Three scoring layers, applied per experiment:

1. Deterministic checks — Python assertions on structured RagTeamResult fields (evidence_status, confidence, intent, audit_events). Zero cost, instant, reproducible. Used by all 7 experiments.

2. Custom LLM judges — Two domain-specific judges built with Agno Agent: an accuracy scorer (1-10 scale against ground-truth dollar values) and a hallucination detector (JSON-structured grounding assessment). Used by experiments 1 and 2.

3. Built-in RAGAS evaluatorragas/faithfulness runs server-side on LangWatch via experiment.evaluate(). Provides a standardized faithfulness score as a dual signal alongside the custom hallucination judge. Used by experiment 2.
Execution Order
Cheapest First
Safety (det) → Accuracy (LLM) → Robustness (det)
Total Est. Cost
~$1.00
52 cases, ~90K tokens
Parallelism
threads=2 / 1
Reduced from 4 to avoid FD exhaustion
CI/CD Ready
Yes
jupyter execute + non-zero exit on failures