Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Enhancing RAS in AI Hardware and High-Performance Computing with Real-Time Health Monitoring

Open Compute Project via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
This 13-minute talk by Guy Gozlan, Machine Learning and Algorithms Director at proteanTecs, explores how Real-Time Health Monitoring (RTHM) revolutionizes reliability, availability, and serviceability (RAS) in expanding AI cloud services and hyperscale data centers. Learn about RTHM's paradigm-shifting approach to predicting and avoiding semiconductor failures through embedded deep data monitoring. Discover how this technology mitigates silent data corruption, optimizes performance through predictive and prescriptive maintenance, and reduces unplanned downtime by detecting potential failures before they occur. The presentation details how continuous parametric measurements enable proactive failure avoidance, ultimately enhancing the reliability of advanced electronics and ensuring high-performance computing reliability in increasingly complex systems.

Syllabus

Enhancing RAS in AI Hardware and High-Performance Computing with Real-Time Health Monitoring

Taught by

Open Compute Project

Reviews

Start your review of Enhancing RAS in AI Hardware and High-Performance Computing with Real-Time Health Monitoring

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.