Learn the Skills Netflix, Meta, and Capital One Actually Hire For
Get 20% off all career paths from fullstack to AI
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore an in-depth characterization study of Large Language Model (LLM) development in datacenters through this 17-minute conference talk from NSDI '24. Delve into the challenges and opportunities of efficiently utilizing large-scale cluster resources for LLM development, including hardware failures, parallelization strategies, and resource utilization. Examine the differences between LLMs and traditional task-specific Deep Learning workloads, and discover potential optimizations for LLM-tailored systems. Learn about innovative approaches such as fault-tolerant pretraining and decoupled scheduling for evaluation, designed to enhance fault tolerance and achieve timely performance feedback in LLM development environments.
Syllabus
NSDI '24 - Characterization of Large Language Model Development in the Datacenter
Taught by
USENIX