Generative AI Model Data Pre-Training on Kubernetes - A Use Case Study

Explore how Kubeflow Pipelines (KFP) streamline Large Language Model data preprocessing at enterprise scale in this 34-minute conference talk from DevConf.US 2025. Learn how IBM Research processes petabytes of data daily using KFP to build indemnified LLMs for enterprise applications, addressing the complexity and scale challenges that typically span days of processing time. Discover the advantages of choosing Kubernetes-based solutions over alternatives like Rust, Slurm, or Spark for LLM experiments and enterprise use cases. Examine how the open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, including critical processes like deduplication, content classification, and tokenization. Gain insights into real-world challenges, lessons learned, and practical applications of KFP for diverse LLM tasks including data preprocessing, RAG retrieval, and model fine-tuning, with speaker Santosh Borse sharing direct experience from IBM Research's daily operations.