Fast and Furious - Practice in Horizon Robotics on Large-scale End-to-end Model Training
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn how Horizon Robotics tackles large-scale end-to-end model training for autonomous driving technology in this 30-minute conference talk from CNCF. Discover the company's approach to efficiently training and deploying advanced perception models like Sparse4D using cloud-native technologies and deep learning algorithms combined with chip design expertise. Explore the significant challenges of managing massive video datasets and numerous small files while maintaining high-performance training across over 2000 GPUs on RDMA infrastructure. Understand how to quickly identify various failure types and diagnose issues in large-scale training environments. Gain insights into Horizon Robotics' strategies for managing large-scale training on Kubernetes, including the implementation of distributed data caching, network topology awareness, and job affinity scheduling to optimize 2000 GPU training jobs. Learn about effective approaches for restoring interrupted training jobs through backup machine replacement to enhance task resilience. Discover practical experiences with CNCF projects including Volcano for job scheduling, Fluid for data orchestration, and NPD (Node Problem Detector) for cluster health monitoring in production autonomous driving model training environments.
Syllabus
Fast and Furious: Practice in Horizon Robotics on Large-scale End-to-e... Chen Yangxue, & Zhihao Xu
Taught by
CNCF [Cloud Native Computing Foundation]