Testing, Evaluating and Debugging Generative AI and Agentic Applications

Learn comprehensive approaches to testing, evaluating, and debugging generative AI and agentic applications in this 31-minute conference talk from the Linux Foundation. Explore the unique challenges that arise when working with generative AI technologies including LLMs, RAG systems, vector databases, and knowledge graphs, particularly when enhanced with orchestration frameworks like LangChain and LangGraph. Discover how natural language interfaces, non-deterministic outputs, and complex hierarchies of autonomous agents create new testing paradigms that require specialized approaches. Examine various open source tools including Langfuse, TueLens, Ragas, and Opik for comprehensive testing and evaluation workflows. Understand the underlying architecture of these testing tools and learn about key metrics for measuring correctness and relevance in AI applications. Gain insights from real-world experiences implementing these testing strategies at scale, including challenges overcome, lessons learned, and best practices for integrating with existing testing and observability infrastructure. Master practical techniques for troubleshooting complex agentic systems and ensuring reliable performance in production environments.