Why Is ML on Kubernetes Hard? Defining How ML and Software Diverge

Explore the fundamental challenges of deploying machine learning workloads on Kubernetes in this 29-minute conference talk from MLOps World. Discover why ML engineers continue to struggle with infrastructure friction that software engineers have largely solved, as Donny Greenberg and Paul Yang from Runhouse examine the evolution of ML platforms from Facebook's FBLearner to modern orchestration tools and analyze how these early implementations created today's infrastructure pain points. Learn about the key differences between ML and traditional software engineering, including GPU dependencies, the absence of effective local testing capabilities, heterogeneity across distributed frameworks like Ray, Spark, PyTorch, TensorFlow, and Dask, plus the unique observability challenges that emerge at scale. Understand how historical ML platform decisions continue to impact current workflows and get introduced to Kubetorch, a Kubernetes-native compute platform designed to bridge the gap between iterative Python development and scalable Kubernetes execution. Gain insights into why ML teams require specialized platform engineering approaches rather than standard DevOps practices, and discover practical solutions for creating more ergonomic ML development workflows that maintain the scalability benefits of Kubernetes while reducing operational complexity.