Detecting Confident Nonsense - Testing LLM-Driven Apps

Learn practical strategies for testing and evaluating LLM-driven applications in this 20-minute conference talk from Code BEAM Europe 2025. Explore the challenges developers face when integrating large language models into products, particularly the problem of "confident nonsense" where AI systems provide fluent but incorrect or potentially harmful responses. Discover evaluation techniques ranging from basic BLEU and ROUGE metrics to more sophisticated aspect-based evaluation and retrieval scoring methods. Understand what metrics to measure, when to trust different evaluation approaches, and how to implement testing strategies that can catch problematic AI responses before they reach production users. Gain insights into building robust validation systems for applications that generate human language, moving beyond traditional unit testing to address the unique challenges of LLM integration.