Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Explore the intricacies of managing Google's TPUv4 Machine Learning Supercomputer in this 18-minute conference talk from NSDI '24. Delve into the design and operation of the software infrastructure that enables TPUv4 supercomputers to function at scale, with a focus on automatic fault resiliency and hardware recovery features. Learn about the software-defined networking (SDN) approach used to manage the high-bandwidth inter-chip interconnect (ICI) fabric, including the use of optical circuit switching for dynamic route configuration to circumvent machine, chip, and link failures. Discover how the infrastructure detects failures, triggers automatic reconfigurations to minimize workload disruption, and initiates remediation and repair workflows for affected components. Gain insights into how similar techniques interface with maintenance and upgrade workflows for both hardware and software. Understand how this dynamic reconfiguration approach allows TPUv4 supercomputers to achieve 99.98% system availability, effectively handling hardware outages experienced by approximately 1% of training jobs.