Alibaba HPN - Data Center Network Architecture for Large Language Model Training
Open Compute Project via YouTube
Google, IBM & Microsoft Certificates — All in One Plan
MIT Sloan AI Adoption: Build a Playbook That Drives Real Business ROI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn about Alibaba's High Performance Network (HPN) architecture in this technical presentation that explores innovative solutions for Large Language Model (LLM) training infrastructure. Discover how traditional data center networks fall short for LLM training workloads, which generate fewer but much larger data flows compared to general cloud computing. Explore the unique 2-tier, dual-plane architecture that can connect 15,000 GPUs in a single Pod, improving upon conventional 3-tier Clos designs. Examine how the dual-ToR implementation enhances reliability by eliminating single points of failure, while the architecture's design prevents hash polarization and optimizes path selection for managing elephant flows. Gain valuable insights from real-world deployment experiences and operational lessons learned from implementing HPN in production environments.
Syllabus
Alibaba HPN: A Data Center Network for Large Language Model Training
Taught by
Open Compute Project