Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Hardware Fault Management - Progress Towards Reliability at Scale

Open Compute Project via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore cutting-edge approaches to server reliability and fault management in this 32-minute panel discussion from the OCP Global Summit featuring industry experts from Meta, Intel, AMD, and Microsoft. Learn about the latest developments in Hardware Fault Management (HWFM), Fleet Memory Fault Management (FMFM), and the RAS API as panelists outline unified strategies for error handling and reporting in AI-ready data centers. Gain insights into standardizing DDR5 error diagnosis, implementing cross-vendor tools for DRAM fault analytics, and adopting consistent RAS API action queues to enhance both in-band and out-of-band fault response capabilities. Discover actionable best practices for developing scalable, interoperable reliability solutions for high-performance computing environments while understanding opportunities to contribute to OCP's collaborative workgroups that are shaping the future of silicon and data center resilience.

Syllabus

Panel Hardware Fault Management Progress Towards Reliability at Scale

Taught by

Open Compute Project

Reviews

Start your review of Hardware Fault Management - Progress Towards Reliability at Scale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.