Telemetry-Based Load Balancing of AI/ML Workloads in Self-Healing Networks
Open Compute Project via YouTube
Google Data Analytics, IBM AI & Meta Marketing — All in One Subscription
Live Online Classes in Design, Coding & AI — Small Classes, Free Retakes
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how Tencent implemented a self-healing network for AI/ML workloads in this 19-minute technical presentation from Broadcom experts. Explore the unique challenges of AI/ML network traffic, which differs from traditional workloads by having fewer flows that consume significant bandwidth and quickly saturate links while requiring lossless fabric and low latency. Discover how Ethernet-based technologies and the SAI/SONiC ecosystem are being utilized alongside Broadcom's innovative networking solutions to maintain optimal performance. Gain insights into the implementation of In-band telemetry and packet drop monitoring capabilities, and understand how applications leverage granular network telemetry data to dynamically optimize load balancing for AI/ML workload flows.
Syllabus
Telemetry based load balancing of AI/ML workloads
Taught by
Open Compute Project