The Evaluation Framework
7 experiments, 52 test cases — a comprehensive eval stack
All evals run from a single Jupyter notebook (evals/eval_langwatch.ipynb) using the
LangWatch SDK. Each eval suite initializes its own experiment via langwatch.experiment.init(),
iterates over fixtures with experiment.loop(), and logs metrics with experiment.log().
The Agno instrumentor captures all nested LLM/tool calls as spans.
The LangWatch dashboard provides a live view of experiment results, including per-case metrics, latency distributions, pass/fail rates, and RAGAS faithfulness scores — all captured from a single notebook execution.
The notebook executes 7 experiments sequentially. Each experiment builds a DataFrame from fixture files, loops over rows with configurable parallelism, runs the RAG pipeline, and logs pass/fail metrics plus optional LLM judge scores.
graph LR
subgraph Fixtures
F1["fixtures_accuracy.py"]
F2["fixtures_safety.py"]
F3["fixtures_tier2.py"]
end
subgraph Notebook["eval_langwatch.ipynb"]
SETUP["langwatch.setup + AgnoInstrumentor"]
PIPE["Build RAG Pipeline"]
LOOP["experiment.loop over DataFrame"]
end
subgraph Pipeline["RAG Pipeline"]
QR["Query Router"]
RA["Retrieval Agent"]
EV["Evidence Verifier"]
RP["Response Planner"]
end
subgraph Scoring
DET["Deterministic Checks"]
LLM["LLM Judge"]
RAGAS["ragas/faithfulness"]
end
subgraph Output
LOG["experiment.log"]
LW["LangWatch Dashboard"]
end
F1 --> LOOP
F2 --> LOOP
F3 --> LOOP
SETUP --> PIPE
PIPE --> QR
LOOP --> QR
QR --> RA
RA --> EV
EV --> RP
RP --> DET
RP --> LLM
RP --> RAGAS
DET --> LOG
LLM --> LOG
RAGAS --> LOG
LOG --> LW
classDef fixture fill:var(--mm-coral-bg),stroke:var(--mm-coral),stroke-width:2px
classDef notebook fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:2px
classDef pipe fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:1.5px
classDef scoring fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:1.5px
classDef output fill:var(--mm-amber-bg),stroke:var(--mm-amber),stroke-width:2px
class F1,F2,F3 fixture
class SETUP,PIPE,LOOP notebook
class QR,RA,EV,RP pipe
class DET,LLM,RAGAS scoring
class LOG,LW output
Each experiment targets a specific layer of the RAG pipeline. Tier 1 evals (accuracy, hallucination, safety) catch critical regressions. Tier 2 evals (router, actions, multi-turn) test robustness and edge cases. The ordering runs cheapest/fastest first.
graph TD
subgraph tier1["Tier 1 — Critical"]
E3["3. Cross-Claim Safety"]
E4["4. Prompt Injection"]
E1["1. Answer Accuracy"]
E2["2. Hallucination Detection"]
end
subgraph tier2["Tier 2 — Robustness"]
E5["5. Router Edge Cases"]
E6["6. Action Requests"]
E7["7. Multi-Turn Follow-Ups"]
end
subgraph layers["Pipeline Layer Tested"]
L1["Input Guardrail"]
L2["Query Router"]
L3["Retrieval + GraphRAG"]
L4["Evidence Verifier"]
L5["Response Planner"]
L6["Action Proposals"]
L7["Session Context"]
end
E4 -.-> L1
E5 -.-> L2
E1 -.-> L3
E2 -.-> L4
E3 -.-> L4
E1 -.-> L5
E6 -.-> L6
E7 -.-> L7
classDef t1 fill:var(--mm-red-bg),stroke:var(--mm-red),stroke-width:2px
classDef t2 fill:var(--mm-navy-bg),stroke:var(--mm-navy),stroke-width:2px
classDef layer fill:var(--mm-green-bg),stroke:var(--mm-green),stroke-width:1.5px
class E1,E2,E3,E4 t1
class E5,E6,E7 t2
class L1,L2,L3,L4,L5,L6,L7 layer
| # | Experiment | Cases | Scoring | Threads | Metrics Logged | Fixture Source |
|---|---|---|---|---|---|---|
| 1 | answer-accuracy |
10 | Det + LLM | 2 | value_match (score + pass), judge (score), pass
Critical values must appear in text AND/OR judge score ≥ 7 |
fixtures_accuracy.py → ACCURACY_CASES |
| 2 | hallucination-detection |
7 | LLM + RAGAS | 2 | grounded, structural, pass, ragas/faithfulness
Custom judge + built-in RAGAS evaluator for dual signal |
fixtures_accuracy.py → HALLUCINATION_CASES |
| 3 | cross-claim-safety |
8 | Deterministic | 2 | status_conflict, confidence_zero, human_review, audit_event, pass
All 4 checks must pass for blocked cases |
fixtures_safety.py → CROSS_CLAIM_CASES |
| 4 | prompt-injection-safety |
7 | Deterministic | 1 | blocked, pass
Suite A: input guardrail exception. Suite B: pipeline audit events |
fixtures_safety.py → ALL_INJECTION_CASES |
| 5 | router-edge-cases |
10 | Deterministic | 2 | intent (score + pass)
Intent enum must match expected. Includes 3 Spanish queries |
fixtures_tier2.py → ROUTER_CASES |
| 6 | action-requests |
4 | Deterministic | 2 | has_proposal, requires_approval, action_type, has_rationale, has_name, pass
HITL gating: every action must require human approval |
fixtures_tier2.py → ACTION_REQUEST_CASES |
| 7 | multi-turn-follow-ups |
6 | Deterministic | 1 | intent, contains:*, contains_any, turn_pass
Sequential execution. Fresh pipeline per conversation |
fixtures_tier2.py → MULTI_TURN_CASES |
1. Deterministic checks — Python assertions on structured
RagTeamResult fields
(evidence_status, confidence, intent, audit_events).
Zero cost, instant, reproducible. Used by all 7 experiments.2. Custom LLM judges — Two domain-specific judges built with Agno
Agent:
an accuracy scorer (1-10 scale against ground-truth dollar values) and a hallucination detector
(JSON-structured grounding assessment). Used by experiments 1 and 2.3. Built-in RAGAS evaluator —
ragas/faithfulness runs server-side on
LangWatch via experiment.evaluate(). Provides a standardized faithfulness score as a
dual signal alongside the custom hallucination judge. Used by experiment 2.