Learn Generative AI, Prompt Engineering, and LLMs for Free
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore cutting-edge approaches to server reliability and fault management in this 32-minute panel discussion from the OCP Global Summit featuring industry experts from Meta, Intel, AMD, and Microsoft. Learn about the latest developments in Hardware Fault Management (HWFM), Fleet Memory Fault Management (FMFM), and the RAS API as panelists outline unified strategies for error handling and reporting in AI-ready data centers. Gain insights into standardizing DDR5 error diagnosis, implementing cross-vendor tools for DRAM fault analytics, and adopting consistent RAS API action queues to enhance both in-band and out-of-band fault response capabilities. Discover actionable best practices for developing scalable, interoperable reliability solutions for high-performance computing environments while understanding opportunities to contribute to OCP's collaborative workgroups that are shaping the future of silicon and data center resilience.
Syllabus
Panel Hardware Fault Management Progress Towards Reliability at Scale
Taught by
Open Compute Project