Generative AI Model Data Pre-Training on Kubernetes - A Use Case Study

Explore how to leverage Kubernetes for large-scale generative AI model data preprocessing in this conference talk from the Linux Foundation. Learn how Large Language Models require preprocessing vast amounts of data—often spanning petabytes—a process that can take days due to its complexity and scale. Discover how Kubeflow Pipelines (KFP) simplify LLM data processing by providing flexibility, repeatability, and scalability for enterprise applications, as demonstrated through daily use at IBM Research for building indemnified LLMs. Compare different data preparation toolkits built on Kubernetes, Rust, Slurm, or Spark, and understand the decision-making process for choosing the right toolkit for LLM experiments or enterprise use cases. Examine how the open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, including processes like deduplication, content classification, and tokenization. Gain insights from real-world challenges, lessons learned, and practical experiences with KFP, while exploring its applicability for diverse LLM tasks such as data preprocessing, RAG retrieval, and model fine-tuning.