Mitigating Silent Data Corruption: Industry-Academia Collaboration and Progress
Open Compute Project via YouTube
Most AI Pilots Fail to Scale. MIT Sloan Teaches You Why — and How to Fix It
Python, Prompt Engineering, Data Science — Build the Skills Employers Want Now
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This 20-minute talk from the Open Compute Project features Emel Goksu (Meta - Ecosystem and Partnerships Lead) discussing the critical challenge of Silent Data Corruption (SDC) in computing systems. Learn about the collaborative efforts between major tech companies including AMD, ARM, Google, Intel, Meta, Microsoft, and NVIDIA in developing the Server Compute Resiliency Specification since 2022. Discover how this industry initiative partners with multiple universities through the Open Compute Project to advance research in detecting and mitigating these rare but impactful errors that become increasingly significant at scale, especially with growing AI workloads. The presentation covers the recent milestone achievement of Specification 1.0 released during the 2024 OCP Global Summit, and outlines ongoing work toward the next specification version focusing on GPUs, continued university research, and the development of a handbook for AI developers.
Syllabus
Mitigating Silent Data Corruption: Industry- Academia Collaboration and Progress
Taught by
Open Compute Project