Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the intricacies of managing Google's TPUv4 Machine Learning Supercomputer in this 18-minute conference talk from NSDI '24. Delve into the design and operation of the software infrastructure that enables TPUv4 supercomputers to function at scale, with a focus on automatic fault resiliency and hardware recovery features. Learn about the software-defined networking (SDN) approach used to manage the high-bandwidth inter-chip interconnect (ICI) fabric, including the use of optical circuit switching for dynamic route configuration to circumvent machine, chip, and link failures. Discover how the infrastructure detects failures, triggers automatic reconfigurations to minimize workload disruption, and initiates remediation and repair workflows for affected components. Gain insights into how similar techniques interface with maintenance and upgrade workflows for both hardware and software. Understand how this dynamic reconfiguration approach allows TPUv4 supercomputers to achieve 99.98% system availability, effectively handling hardware outages experienced by approximately 1% of training jobs.

Syllabus

NSDI '24 - Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Taught by

USENIX

Reviews

Start your review of Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.