From Scaling to Observability - Solving Key Challenges for Distributed ML with Ray
Data Council via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This 28-minute conference talk from Data Council explores the observability challenges encountered when scaling distributed machine learning training across thousands of nodes using Ray. Discover insights from Nikita Vemuri, Software Engineer at Anyscale, who shares practical experiences in tracking vast amounts of system data in multi-node environments. Learn effective strategies for correlating information across clusters and designing observability stacks that balance providing relevant insights with maintaining data privacy. Valuable for professionals running large-scale ML workloads or building monitoring systems for distributed training environments.
Syllabus
From Scaling to Observability Solving Key Challenges for Distributed ML with Ray
Taught by
Data Council