Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide

Discover the intricacies of multi-node Large Language Model (LLM) finetuning in a comprehensive conference talk that provides a practical, step-by-step implementation guide for Kubernetes clusters with GPUs. Learn how to leverage PyTorch FSDP and the Kubeflow training operator while mastering essential aspects of cluster preparation, optimization techniques, and performance comparisons across various network topologies. Explore critical configurations including pod networking, secondary networks, and GPU Direct RDMA over ethernet to achieve optimal performance. Gain hands-on knowledge about enhancing model performance on specific downstream tasks through enterprise private data finetuning, while understanding the substantial compute resource requirements and unique challenges in Kubernetes environments. Master the implementation details necessary to successfully introduce multi-node LLM finetuning in production Kubernetes environments.