Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Taming Distributed AI Training with Ray and Datadog Observability

Anyscale via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to monitor, debug, and optimize large-scale distributed AI training workloads through this conference talk from Ray Summit 2025. Discover how Datadog tackles the complex challenges of running thousands of tasks across heterogeneous GPU clusters, addressing common issues like job stalling, unexpected GPU idling, and difficult-to-diagnose slowdowns. Explore the real-world observability techniques developed by Datadog's engineering team to transform opaque multi-GPU distributed jobs into transparent, debuggable systems. Master the identification of critical failure modes including task backpressure, resource fragmentation, object store contention, spillover, slow nodes, and scheduler bottlenecks. Understand which specific metrics, traces, and logs provide the most valuable insights for diagnosing bottlenecks and failures in Ray deployments. Gain practical strategies for correlating observability signals to make distributed training workloads more reliable and performant. Learn proven techniques for identifying GPU underutilization, tracking straggler tasks, analyzing scheduling delays, and detecting systemic issues before they escalate into major problems. Apply these observability practices to build confidence in your Ray clusters, whether you're training your first multi-node model or operating production-scale LLM jobs.

Syllabus

Taming Distributed AI Training with Ray + Datadog Observability | Ray Summit 2025

Taught by

Anyscale

Reviews

Start your review of Taming Distributed AI Training with Ray and Datadog Observability

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.