Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide
CNCF [Cloud Native Computing Foundation] via YouTube
Lead AI-Native Products with Microsoft's Agentic AI Program
PowerBI Data Analyst - Create visualizations and dashboards from scratch
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Discover the intricacies of multi-node Large Language Model (LLM) finetuning in a comprehensive conference talk that provides a practical, step-by-step implementation guide for Kubernetes clusters with GPUs. Learn how to leverage PyTorch FSDP and the Kubeflow training operator while mastering essential aspects of cluster preparation, optimization techniques, and performance comparisons across various network topologies. Explore critical configurations including pod networking, secondary networks, and GPU Direct RDMA over ethernet to achieve optimal performance. Gain hands-on knowledge about enhancing model performance on specific downstream tasks through enterprise private data finetuning, while understanding the substantial compute resource requirements and unique challenges in Kubernetes environments. Master the implementation details necessary to successfully introduce multi-node LLM finetuning in production Kubernetes environments.
Syllabus
Multi-Node Finetuning LLMs on Kubernetes: A Practitioner’s Guide - Ashish Kamra & Boaz Ben Shabat
Taught by
CNCF [Cloud Native Computing Foundation]