The Art and Science behind evaluating AI Agents at scale

About This Session

Evaluating AI Agent is a mix of science and art. Working with subject matter experts is more important than ever. New methods and best practices are emerging to evaluate these systems at scale. In this talk we will discuss a case study of a production agent used across an entire company. We will discuss live evals, how to build a golden dataset, how to collaborate with SMEs, what worked and what didn't over a 6 months project. We will share some of the best practices we found are working well in production contexts after investing hundreds of hours analyzing evals, building reports and iterating. This talk is not about just the theory, we will use a real case study and we will share all the info you need to really iterate fast and build evals that matter for your use case!