Checkpoint Offloading SSD - Enhancing Performance and Scalability in LLM Training
Open Compute Project via YouTube
Free courses from frontend to fullstack and AI
Build GenAI Apps from Scratch — UCSB PaCE Certificate Program
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn about innovative storage solutions for Large Language Model (LLM) training in this technical presentation from SK Hynix experts. Explore how checkpoint offloading SSD technology addresses performance bottlenecks and enhances scalability in LLM training environments. Discover methods for managing model states, including parameters, momentums, and variances, while reducing data movement between GPUs and storage. Examine experimental results demonstrating how AI storage solutions can optimize GPU memory usage and improve overall training efficiency by offloading optimizer operations to storage. Gain insights into practical approaches for handling interruptions and failures during LLM training through persistent storage strategies.
Syllabus
Checkpoint Offloading SSD Enhancing Performance and Scalability in LLM Training
Taught by
Open Compute Project