Pushing Limits of Supercomputing Innovation on Azure AI Infrastructure

Explore Azure's supercomputing validation process in this 36-minute conference talk from Microsoft Ignite 2025, where engineers Hugo Affaticati and Nitin Nagarkatte demonstrate how precision-focused training efficiency drives AI infrastructure innovation. Discover the comprehensive validation methodology that spans from GPU kernels to LLAMA pretraining and large-scale model training, designed to detect bottlenecks early, reduce costs, and boost performance for multi-billion parameter models. Learn about the fundamental AI infrastructure stack encompassing compute, network, storage, and managed services, while gaining insights into the newly announced GB300 GPU general availability on Azure. Examine the evolution of AI models from 2019 to present and understand how Azure's core infrastructure pillars support massive data ingestion at cloud scale. Analyze performance growth trajectories and Azure's production-scale GPU supercomputers, including detailed validation results from the GRAC 314B model implementation. Compare GPU generations including GB200/GB300 versus H100 workloads to understand performance differentials and optimization strategies. Gain practical insights into achieving predictable throughput, faster training cycles, and building confidence in Azure's readiness for enterprise-scale AI deployments through direct engagement with the engineering teams driving these supercomputing innovations.

Syllabus

0:00 - History of model evolution from 2019 to present
00:08:07 - Fundamental AI infrastructure stack: compute, network, storage, and managed services
00:10:37 - Announcement of GB300 GPU general availability on Azure
00:16:37 - Core pillars of cloud infrastructure: compute, network, and storage
00:17:21 - Introduction to Data Ingestion and its Scale in Azure Cloud
00:25:32 - Performance Growth Over Time and Azure Production Scale with GPU Supercomputers
00:28:25 - Large-Scale Validation with GRAC 314B Model
00:31:11 - GPU Generations: GB200/GB300 vs H100 Workloads
00:35:00 - Summary of Azure’s AI Infrastructure and Year-over-Year Improvements