Testing AI Agents: Automated Evaluation for Chatbots & RAG Systems

About This Session

AI Agents, chatbots, and RAG systems are easy to prototype — but difficult to test reliably. Small changes to prompts, models, retrieval sources, or system instructions can silently change behavior, and classic assertions (string matching, snapshots) often fail to capture what actually matters: correctness, relevance, grounded answers, and consistent multi-turn dialogue. In this talk, we’ll start with the common testing problems in real projects: “it worked yesterday”, hidden regressions, evaluation noise, and the challenge of aligning developers and stakeholders on what “good” means. Then we’ll explore practical testing possibilities with evaluation frameworks like DeepEval: how to validate responses beyond keyword matching, how to structure test cases for both chatbots and retrieval-based assistants, how to define pragmatic quality gates, and how to run these checks continuously in suggests, then? As a side topic, we’ll show how BDD/Gherkin can wrap these evaluations into human-readable scenarios (Given–When–Then), making expectations reviewable by non-developers while keeping the actual validation powered by automated evaluation metrics. You’ll leave with a reusable blueprint for introducing automated AI evaluation into your development workflow — from local runs to CI pipelines with actionable reports.