New Approaches to Network Telemetry for AI Performance Optimization
Open Compute Project via YouTube
Google AI Professional Certificate - Learn AI Skills That Get You Hired
The Investment Banker Certification
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.
Syllabus
New approaches to network telemetry Essential for AI performance
Taught by
Open Compute Project