Mitigating Silent Data Corruption: Industry-Academia Collaboration and Progress
Open Compute Project via YouTube
Free courses from frontend to fullstack and AI
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
This 20-minute talk from the Open Compute Project features Emel Goksu (Meta - Ecosystem and Partnerships Lead) discussing the critical challenge of Silent Data Corruption (SDC) in computing systems. Learn about the collaborative efforts between major tech companies including AMD, ARM, Google, Intel, Meta, Microsoft, and NVIDIA in developing the Server Compute Resiliency Specification since 2022. Discover how this industry initiative partners with multiple universities through the Open Compute Project to advance research in detecting and mitigating these rare but impactful errors that become increasingly significant at scale, especially with growing AI workloads. The presentation covers the recent milestone achievement of Specification 1.0 released during the 2024 OCP Global Summit, and outlines ongoing work toward the next specification version focusing on GPUs, continued university research, and the development of a handbook for AI developers.
Syllabus
Mitigating Silent Data Corruption: Industry- Academia Collaboration and Progress
Taught by
Open Compute Project