From Scaling to Observability - Solving Key Challenges for Distributed ML with Ray
Data Council via YouTube
2,000+ Free Courses with Certificates: Coding, AI, SQL, and More
Master Windows Internals - Kernel Programming, Debugging & Architecture
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This 28-minute conference talk from Data Council explores the observability challenges encountered when scaling distributed machine learning training across thousands of nodes using Ray. Discover insights from Nikita Vemuri, Software Engineer at Anyscale, who shares practical experiences in tracking vast amounts of system data in multi-node environments. Learn effective strategies for correlating information across clusters and designing observability stacks that balance providing relevant insights with maintaining data privacy. Valuable for professionals running large-scale ML workloads or building monitoring systems for distributed training environments.
Syllabus
From Scaling to Observability Solving Key Challenges for Distributed ML with Ray
Taught by
Data Council