Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide
CNCF [Cloud Native Computing Foundation] via YouTube
Power BI Fundamentals - Create visualizations and dashboards from scratch
Get 35% Off CFI Certifications - Code CFI35
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Discover the intricacies of multi-node Large Language Model (LLM) finetuning in a comprehensive conference talk that provides a practical, step-by-step implementation guide for Kubernetes clusters with GPUs. Learn how to leverage PyTorch FSDP and the Kubeflow training operator while mastering essential aspects of cluster preparation, optimization techniques, and performance comparisons across various network topologies. Explore critical configurations including pod networking, secondary networks, and GPU Direct RDMA over ethernet to achieve optimal performance. Gain hands-on knowledge about enhancing model performance on specific downstream tasks through enterprise private data finetuning, while understanding the substantial compute resource requirements and unique challenges in Kubernetes environments. Master the implementation details necessary to successfully introduce multi-node LLM finetuning in production Kubernetes environments.
Syllabus
Multi-Node Finetuning LLMs on Kubernetes: A Practitioner’s Guide - Ashish Kamra & Boaz Ben Shabat
Taught by
CNCF [Cloud Native Computing Foundation]