Benchmarking GenAI Like a Pro - Scaling Experiments, Predicting Performance, and Keeping Your Sanity

Learn how to build a scalable benchmarking system for generative AI models in this conference talk from the Linux Foundation's Open Source Summit. Discover how to transition from scattered, ad-hoc experiments to a fully structured system capable of running tens of thousands of GenAI experiments across different models, hardware configurations, and tuning techniques with speed and repeatability. Explore the technical architecture using Ray for scaling, Pydantic for schema validation, MySQL for data persistence, and a kubectl-like CLI interface. Master techniques for exploring and optimizing massive configuration spaces, visualizing results with Apache Superset, and implementing predictive models to accelerate insights without brute-force experimentation. Address critical deployment questions about model serving costs, hardware compatibility issues, and fine-tuning performance optimization across different GPU configurations including H100s.