INSPECT - Data Analysis Tool for Proactive Link Failure Detection on Meta's AI Infrastructure
Open Compute Project via YouTube
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about INSPECT, a proactive data analysis tool developed by Meta for detecting link failures in AI infrastructure through this 25-minute conference presentation. Discover how this software daemon addresses the critical challenge of hardware reliability as Meta scales to 100K+ clusters for generative AI applications by collecting and analyzing SerDes parameters to predict potential failures before they cause unplanned downtime. Explore the machine learning algorithms that power INSPECT's anomaly detection capabilities with pass-fail criteria, and understand how the tool standardizes collection and analysis approaches across various SerDes vendors to enable vendor heterogeneity. Examine the system's architecture that leverages common SerDes building blocks mapped to Meta's predefined schema, allowing deployment across heterogeneous AI systems for new system provisioning, training job deployment, and continuous monitoring of thousands of high-speed links and cables.
Syllabus
INSPECT Data analysis tool for Proactive Link Failure Detection on Metas AI Infrastructure
Taught by
Open Compute Project