New Approaches to Network Telemetry for AI Performance Optimization
Open Compute Project via YouTube
Become an AI & ML Engineer with Cal Poly EPaCE — IBM-Certified Training
Learn EDR Internals: Research & Development From The Masters
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn how to optimize large GPU clusters for machine learning workloads in this 11-minute conference talk from Nvidia's Principal Software Research Architect. Explore why traditional data center telemetry approaches fall short for massive ML models and discover new methods for extracting meaningful metrics from large-scale clusters. Examine how ML workloads create unique patterns of similarity and synchronicity across adaptive-routed, rail-optimized, fat-tree topologies, and understand the specialized abstractions developed to identify performance optimization opportunities in ML-focused infrastructure.
Syllabus
New approaches to network telemetry Essential for AI performance
Taught by
Open Compute Project