Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to overcome the operational challenges of scaling Ray on Kubernetes through this 15-minute conference talk from Ray Summit 2025. Discover Google's solutions for eliminating operational toil and improving observability when running distributed Ray workloads in production environments. Explore the common pain points platform teams face with manually managing Ray operators, including the fragile and time-consuming update processes that create significant operational overhead. Understand how the KubeRay GKE Addon provides a fully managed, auto-updating solution that removes the burden of constant operator maintenance, enabling teams to scale Ray workloads without manual intervention. Address the critical observability challenges that arise when Ray jobs fail, where debugging becomes guesswork across multiple layers including Ray applications, Kubernetes infrastructure, and underlying systems. Examine the new RayJob observability dashboard in Google Cloud Logging & Monitoring, which unifies Ray logs, metrics, pod events, and cluster signals into a comprehensive single-pane-of-glass view for accelerated root-cause analysis. Gain insights from Google engineers Sunny Hwang and Raja Jadeja on building high-performance infrastructure purpose-built for large distributed workloads and implementing effective monitoring strategies for production Ray deployments.
Syllabus
Running Ray in Production: Google’s Guide to Operators & Observability | Ray Summit 2025
Taught by
Anyscale