Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about accelerating design space exploration for large language model training systems through this 16-minute conference presentation from NSDI '25. Discover how researchers from Tsinghua University, Zhongguancun Laboratory, and University of Pennsylvania address the challenge of efficiently exploring vast design spaces for LLM training clusters with tens to hundreds of thousands of GPUs. Explore the limitations of current simulation methods that require up to 10,000 experiments to identify optimal configurations and understand how inadequate exploration leads to significant training performance degradation. Examine the Single-process Multi-experiment (SPME) approach and its benefits in reducing scheduling overhead and optimizing resource utilization, while recognizing its limitations for current AI cluster scales. Delve into Multiverse, a novel GPU-based AI training simulator that enhances SPME's efficacy through optimizations including pull-based synchronization, high-fidelity intra-server communication, and kernel-fusion techniques. Analyze extensive experimental results demonstrating Multiverse's accuracy with less than 3.0% discrepancy compared to real-world LLM training on clusters up to 54,000 GPUs, and its impressive 43.1-73.2X speedup over state-of-the-art CPU-based simulators across various use cases.

Syllabus

NSDI '25 - Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment...

Taught by

USENIX

Reviews

Start your review of Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.