Fuzzing in the GenAI Era - AI System Evaluation and Quality Assurance

Build with Azure OpenAI, Copilot Studio & Agentic Frameworks — Microsoft Certified

Learn More →

The Investment Banker Certification

Learn More →

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off

One annual plan covers every course and certificate on Coursera. 40% off for a limited time.

Get Full Access

Explore a comprehensive 19-minute conference talk that redefines AI evaluation through the lens of fuzzing methodology, moving beyond traditional static dataset approaches to dynamic stress testing of AI systems. Learn how to identify and address the "last mile problem" in AI applications, where systems appear to work well in controlled environments but fail when exposed to real-world user interactions and edge cases. Discover the brittleness inherent in GenAI applications through concrete examples of chatbot failures and understand why standard evaluation methods fall short in capturing these vulnerabilities. Master the concept of "Haizing" - a systematic approach to simulating unexpected user inputs at scale to uncover corner cases before deployment. Examine two critical components of robust AI evaluation: developing quality metrics that accurately capture human judgment criteria and generating diverse, representative stimuli that can expose potential system failures. Understand how to scale evaluation processes using AI agents as judges, balancing accuracy with latency considerations, and explore reinforcement learning techniques for tuning evaluation systems. Differentiate between fuzzing and adversarial testing approaches in AI contexts, and see how simulation can be framed as prompt optimization. Analyze real-world case studies including implementations at major European and Fortune 500 banks, demonstrating practical applications of these evaluation methodologies for voice agents and AI applications in financial services.

Syllabus

00:00 Introduction to Haizing
01:16 The "Last Mile Problem" in AI
02:47 The Brittleness of GenAI Applications
03:54 Examples of Brittle Chatbots
04:29 Inadequacy of Standard Evaluation Methods
06:09 Haizing: Simulating the Last Mile
08:43 Scaling Evaluation with Agents as Judges
09:29 Verdict: Accuracy vs. Latency
11:47 Scaling Evaluation with RL-Tuned Judges
14:06 Fuzzing vs. Adversarial Testing in AI
14:37 Simulation as Prompt Optimization
16:23 Case Study: Haizing a Major European Bank's AI App
17:05 Case Study: Haizing a F500 Bank's Voice Agents
17:46 Case Study: Scaling Voice Agent Evals with Verdict