Taming Distributed AI Training with Ray and Datadog Observability

Learn to monitor, debug, and optimize large-scale distributed AI training workloads through this conference talk from Ray Summit 2025. Discover how Datadog tackles the complex challenges of running thousands of tasks across heterogeneous GPU clusters, addressing common issues like job stalling, unexpected GPU idling, and difficult-to-diagnose slowdowns. Explore the real-world observability techniques developed by Datadog's engineering team to transform opaque multi-GPU distributed jobs into transparent, debuggable systems. Master the identification of critical failure modes including task backpressure, resource fragmentation, object store contention, spillover, slow nodes, and scheduler bottlenecks. Understand which specific metrics, traces, and logs provide the most valuable insights for diagnosing bottlenecks and failures in Ray deployments. Gain practical strategies for correlating observability signals to make distributed training workloads more reliable and performant. Learn proven techniques for identifying GPU underutilization, tracking straggler tasks, analyzing scheduling delays, and detecting systemic issues before they escalate into major problems. Apply these observability practices to build confidence in your Ray clusters, whether you're training your first multi-node model or operating production-scale LLM jobs.