Learn Python with Generative AI - Self Paced Online
AI, Data Science & Cloud Certificates from Google, IBM & Meta
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore the intricacies of GPU Cloud infrastructure optimization in this technical conference talk that delves deep into hardware-level considerations for AI systems. Learn how to fine-tune various machine learning models using an H100 Cluster, with detailed analysis of critical components like POD Scheduler, Device Plugin, GPU/NUMA topology, and ROCE/NCCL Stack. Gain valuable insights from first-hand experimental results demonstrating the relationship between model performance and device operator configurations in nodes, focusing particularly on CNN, RNN, and Transformer models from MLPerf. Master the often-overlooked hardware aspects of AI infrastructure that can significantly impact distributed machine learning performance and efficiency.
Syllabus
Optimize Your AI Cloud Infrastructure: A Hardware Perspective - Liang Yan, CoreWeave
Taught by
Linux Foundation