Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Meta AI ML System Error Handling Improvements - PCIe Completion Timeout Error Handling Using Root Port PIO

Open Compute Project via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta's hardware and RAS systems engineers improved error handling in large-scale AI-ML infrastructure through advanced PCIe completion timeout management. Discover the challenges faced in Meta's Grand Teton Training (GTT) platform, which features complex hierarchies of PCIe components including host CPUs, PCIe switches, and end devices across massive machine clusters. Explore how frequent PCIe completion timeout (CTO) events were causing significant job interruptions and understand the innovative solution implemented using Root Port PIO (RP_PIO) error reporting methods. Gain insights into the diagnostic improvements and failure remediation strategies that successfully reduced subsequent job interruptions in production AI-ML workloads. This technical presentation provides practical knowledge for engineers working with large-scale computing infrastructure, PCIe error handling, and system reliability in AI/ML environments.

Syllabus

Meta AI ML System Error Handling Improvements PCIe Completion Timeout error handling using

Taught by

Open Compute Project

Reviews

Start your review of Meta AI ML System Error Handling Improvements - PCIe Completion Timeout Error Handling Using Root Port PIO

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.