Scaling LLM Fine-Tuning with FSDP, DeepSpeed, and Ray

Learn to scale large language model fine-tuning beyond memory constraints using distributed GPU clusters with FSDP, DeepSpeed, and Ray in this comprehensive webinar for ML and platform engineers. Master the orchestration and memory management strategies essential for training frontier-scale models efficiently across distributed systems. Discover how to fine-tune LLMs at scale using Ray and PyTorch, implement checkpoint saving and resuming with Ray Train, and configure ZeRO optimization for optimal memory usage and performance through various stages, mixed precision, and CPU offload techniques. Gain hands-on experience launching distributed training jobs and develop a working understanding of Ray's capabilities for accelerating LLM development. Walk away with practical knowledge, a reusable project foundation, and clear insights into how Ray and Anyscale integrate to streamline large-scale machine learning workflows.