Workshop

The Art and Science behind evaluating AI Agents at scale

with Alfonso Graziano

AI Coding Assistants
AI Models
Retrieval-Augmented Generation (RAG)

Free for All Attendees · Seats Limited

Workshops are included with your event ticket at no extra cost. Seats fill up fast — registration opens through the official event app approximately one week before the event. Follow app notifications to know the moment sign-ups go live.

Get Your Event Ticket

Starts

Wed 8 Jul, 16:00

Ends

Wed 8 Jul, 18:00

About This Workshop

Evaluating AI Agent is a mix of science and art. Working with subject matter experts is more important than ever. New methods and best practices are emerging to evaluate these systems at scale. In this talk we will discuss a case study of a production agent used across an entire company. We will discuss live evals, how to build a golden dataset, how to collaborate with SMEs, what worked and what didn't over a 6 months project. We will share some of the best practices we found are working well in production contexts after investing hundreds of hours analyzing evals, building reports and iterating. This talk is not about just the theory, we will use a real case study and we will share all the info you need to really iterate fast and build evals that matter for your use case!

More to Explore

More Workshops

More hands-on sessions waiting — find the one that fits your stack.

The Art and Science behind evaluating AI Agents at scale

About This Workshop

More Workshops

Prompt Engineering Hands-on

Teaching Kubernetes Security in your Cluster

The Decisions Developers Make Without Noticing – And How to Make Them Better

Rust on Robots: Hands-on Embedded Rust on STM32

From COBOL to Java: How Developers Transition 60 Years of Legacy into Modern Java Services

Agents of Football: Build AI Agents That Compete in Live Football Matches

What do we need to deliver high quality products?

Deep Dive: Mastering Agents, MCP and other hypes

From messy addresses to production-ready data: Build a location enrichment pipeline on AWS with HERE

Beyond the Thor’s Hammer: Pragmatic Agentic AI with Caching, Reuse, and Cost Guardrails

Engineering Customer Journey Analytics at Scale: Lessons from Germany's Largest Banking Platform

Trust code you didn't write: From code review to confidence

Getting started with Hexagonal Architecture

Build an agentic full-stack tabletop game master application

MCP is all you need to make an AI-agent consume your RESTful API

Build a Production-Ready AI Agent in 90 Minutes

How to Pitch Innovation to Your CEO

How does a Java agent work? Building a Java agent from scratch.

Adopting GitOps for microservices delivery via Argo CD

Build a data-intensive dashboard (that actually works)

Ideate & Strategize: Defining Your Football for Good Prototype

Zero to Binary: Building a Production-Ready AI Agent in Go

From Vector Search to Better Understanding: How Hybrid RAG Improves Answers, Not Just Matches

Build a Multi-Agent Marathon Planner with ADK and A2A

Teaching AI to Code in Every Language with NVIDIA NeMo

Defending the Modern Supply Chain: Hands-On Vulnerability Remediation

Hands-on AMQP with LavinMQ: Decoupling Services with Message Queues

You shall (not) pass!? An Introduction to Testing authenftication

Building Modern Distributed Systems using Less AI Tokens

Give your agent a wallet: Build an AI agent with stablecoin payment capabilities

Building a Better Tomorrow: Tips and Tricks for Docker Builds

HR Workshop - Vibe:athon: Where HR Goes to Build 1/4

GitHub Copilot: From Zero to Hero

Building AI apps with the Google ecosystem

Accelerating AI Inference at Scale: A Deep Dive Into NVIDIA Dynamo on Kubernetes

The Bright Data Build-Off

Building & scaling custom serverless AI

The best SDLC is the one you build yourself: Why orchestration changes everything

HR Workshop - Vibe:athon: Where HR Goes to Build 2/4

Level Up Your Automation with GitHub Agentic Workflows

Developing Crash-Proof Java Applications

Faster Together: Train and Deploy a Speculative Decoding Model for Low-Latency LLM Inference

Managing Sovereign AI Infrastructure: MLOps and LLMOps in Highly Regulated Banking IT

Spec-Driven Development with Agentic Skills

Ducks, Sensors & Agents: Hands-On Edge AI with Arduino UNO Q

Shall we play a Game? LLM Security in Practice

HR Workshop - The Human Advantage: Storytelling in the Age of AI

AI That Acts: Orchestrating Agents in Modern Developer Workflows (securely)

Agents at Scale: Multi-Agent Architecture with A2A Protocol on Agent Runtime and ADK Integration

Always-On Autonomous AI Agents: Exploring the OpenClaw Abstraction

Vibe³ Cross-Platform Apps with Lynx: A TikTok Hackathon

Exploring Server Side Rendering

From Hallucination to Justification: Hands-On Explainability for LLMs

Bridging LLMs and Systems: Practical Automation with MCP Tools and Function Calls

HR Workshop - Vibe:athon: Where HR Goes to Build 3/4

Teaching GitHub Copilot COBOL: A Practical Guide to Agentic AI Legacy Modernization

Vibe Coding with Postgres: From Zero to Prod in Your IDE

Compress, Cut, and Distill: The Latest Gen AI Model Compression Techniques in Practice

How to mess up JWT's - a practitioner's guide

Create Your Own Role-Playing Game with Agentic AI using Strands Agents

Build Agents That Can Pay with x402: From Your Laptop to a Live Network

Let agents buy your API: Build a payment-gated service with x402

GenAI in Testing: Using GitHub Copilot to Accelerate Quality Without Losing Trust

HR Workshop - The Human Advantage: Storytelling in the Age of AI

Building Multi-Agent AI Systems with MAF: From Copilot to Orchestrated Agents

Agents That Own Their Inference: Building Production AI Agents on Dedicated GPUs

Generate Synthetic Data for Physical AI with NVIDIA Cosmos World Foundation Models

Hack Me, Bro: An Antifragile AI Battle Arena

Build a Multi-Channel AI Agent

Never say refactoring is impossible

HR Workshop - Vibe:athon: Where HR Goes to Build 4/4

Context > Models: How to make your agents truly intelligent