About This Session
Last year, while preparing a talk, I was looking for a research paper I had read in the past but couldn’t find anymore so I decided to ask AI for help. Instead of the right paper, it generated an impressive list of academic citations. They looked convincing, but when I checked, most of them didn’t exist. Although I knew already about hallucinations, the mathematician in me immediately wanted to understand why this happens so systematically, not just occasionally, leading me into an intensive investigation and research to deeply understand how these models operate. AI models like LLMs don’t “think” like humans, they generate outputs based on probabilities, producing the most statistically likely sequence of words. This means they can sound confident while being completely wrong. These moments, known as hallucinations, are inherent to how generative AI works and if left undetected, they can result in false information being delivered with absolute confidence. Unlike traditional software, where a defect is a deviation from expected results, hallucinations are an expected outcome of the model’s design. And that makes me think: How do we as testers detect, measure, and manage risks that are basically built into the system itself? In this session, we’ll explore hallucinations from an intuitive, mathematical perspective, without difficult or heavy formulas, so anyone can understand why they occur. Then, we’ll explore practical methods for evaluating AI outputs, since conventional testing approaches don´t apply here. You’ll understand how to test AI hallucinations, calculate the confidence and risk of AI outputs, and explain findings effectively. We’ll explore practical takeaways like testing on ground-truth data, using adversarial prompts, and verifying outputs through cross-validation with external sources. Although mathematically it's not possible to avoid hallucinations completely, these methods allow you to estimate the rate of occurrence and reduce their impact.
Topics
- AI Models
- Generative AI (GenAI)
- Reliability
- Testing