Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Training with Confidence - Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

USENIX via YouTube

Overview

Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Learn about TRAINCHECK, a proactive framework designed to detect and diagnose silent errors in deep learning model training through this 17-minute conference presentation from OSDI '25. Discover how researchers from the University of Michigan developed an automated system that infers invariants specifically tailored for deep learning training processes to catch errors that would otherwise go unnoticed. Explore the methodology behind proactive checking approaches that can identify silent training errors during the training process while simultaneously providing debugging assistance to developers. Examine the comprehensive evaluation results showing TRAINCHECK's effectiveness in reproducing 20 real-world silent training errors with diverse root causes, successfully detecting 18 of these errors within a single training iteration. Understand how this framework uncovered 6 previously unknown bugs in popular training libraries that were causing silent errors, demonstrating its practical value for improving the reliability of deep learning training workflows. Gain insights into the challenges of detecting silent errors in complex deep learning training processes and learn about innovative solutions for maintaining training confidence through automated invariant checking.

Syllabus

OSDI '25 - Training with Confidence: Catching Silent Errors in Deep Learning Training with...

Taught by

USENIX

Reviews

Start your review of Training with Confidence - Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.