Python, Prompt Engineering, Data Science — Build the Skills Employers Want Now
Finance Certifications Goldman Sachs & Amazon Teams Trust
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore how to leverage Kubernetes for large-scale generative AI model data preprocessing in this conference talk from the Linux Foundation. Learn how Large Language Models require preprocessing vast amounts of data—often spanning petabytes—a process that can take days due to its complexity and scale. Discover how Kubeflow Pipelines (KFP) simplify LLM data processing by providing flexibility, repeatability, and scalability for enterprise applications, as demonstrated through daily use at IBM Research for building indemnified LLMs. Compare different data preparation toolkits built on Kubernetes, Rust, Slurm, or Spark, and understand the decision-making process for choosing the right toolkit for LLM experiments or enterprise use cases. Examine how the open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, including processes like deduplication, content classification, and tokenization. Gain insights from real-world challenges, lessons learned, and practical experiences with KFP, while exploring its applicability for diverse LLM tasks such as data preprocessing, RAG retrieval, and model fine-tuning.
Syllabus
Generative AI Model Data Pre-Training on Kubernetes: A Use Case S... Anish Asthana & Mohammad Nassar
Taught by
Linux Foundation