A Practical Guide to Benchmarking AI and GPU Workloads in Kubernetes

This conference talk provides a practical guide on benchmarking AI and GPU workloads in Kubernetes environments. Learn how to optimize GPU resource efficiency and enhance performance for AI workloads through effective benchmarking techniques. Discover how to set up, configure, and run various GPU and AI workload benchmarks in Kubernetes, covering a range of use cases including model serving, model training, and GPU stress testing. Explore tools like NVIDIA Triton Inference Server, fmperf for benchmarking LLM serving performance, MLPerf for comparing machine learning systems performance, and utilities such as GPUStressTest, gpu-burn, and cuda benchmark. Gain insights into GPU monitoring and load generation tools through step-by-step demonstrations. Develop practical skills for running benchmarks on GPUs in Kubernetes and leverage existing tools to fine-tune and optimize GPU resource and workload management for improved performance and resource efficiency.