Empowering ML Workloads With Kubeflow: JAX Distributed Training and LLM Hyperparameter Optimization

This 20-minute conference talk from CNCF explores how Kubeflow can enhance machine learning workloads through JAX distributed training and Large Language Model hyperparameter optimization. Presented by Hezhi Xie and Sandipan Panda, the session addresses the growing need for scalable ML solutions in distributed environments. Learn about recent Kubeflow innovations that improve distributed training on Kubernetes with JAX and automate hyperparameter optimization for LLMs. Discover how JAX's high-performance capabilities can be integrated with Kubernetes for efficient scaling, and how the speakers extended Kubeflow to support distributed JAX workloads. The presentation also covers the development of a high-level API that automates the previously manual and time-intensive process of LLM hyperparameter optimization. Understand how these advancements make complex, resource-intensive training more efficient and position Kubeflow as a powerful platform for modern AI development workflows.