Hardware Fault Management - Progress Towards Reliability at Scale

Explore cutting-edge approaches to server reliability and fault management in this 32-minute panel discussion from the OCP Global Summit featuring industry experts from Meta, Intel, AMD, and Microsoft. Learn about the latest developments in Hardware Fault Management (HWFM), Fleet Memory Fault Management (FMFM), and the RAS API as panelists outline unified strategies for error handling and reporting in AI-ready data centers. Gain insights into standardizing DDR5 error diagnosis, implementing cross-vendor tools for DRAM fault analytics, and adopting consistent RAS API action queues to enhance both in-band and out-of-band fault response capabilities. Discover actionable best practices for developing scalable, interoperable reliability solutions for high-performance computing environments while understanding opportunities to contribute to OCP's collaborative workgroups that are shaping the future of silicon and data center resilience.