DShuffle - DPU-Optimized Shuffle Framework for Large-scale Data Processing

Learn about DShuffle, an innovative framework that leverages Data Processing Units (DPUs) to optimize shuffle operations in large-scale distributed data processing systems through this 19-minute conference presentation from USENIX ATC '25. Discover how researchers from Wuhan National Laboratory for Optoelectronics and Huawei Cloud address the critical performance bottleneck of shuffle operations, which are responsible for transferring intermediate data between nodes in distributed computing environments. Explore the technical architecture of DShuffle, which divides shuffle processes into three pipelined stages: serialization, preprocessing, and I/O operations, specifically designed to harness DPU capabilities effectively. Understand how the framework utilizes high-concurrency memory access units to accelerate serialization phases and enables DPUs to directly write intermediate data to disk, eliminating unnecessary data copies and reducing CPU overhead. Examine experimental results from real DPU platform testing with industrial-grade Apache Spark that demonstrate significant improvements in both host CPU efficiency and I/O performance, leading to reduced task completion times in data analysis workloads involving large datasets.