Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

Learn how to curate high-quality text data for pre-training Large Language Models using GPU-accelerated modules from NVIDIA NeMo Curator in this 13-minute tutorial. Follow a step-by-step process of downloading the TinyStories dataset from HuggingFace and processing it through cleaning, filtering, removing duplicates, and redacting personal information. Discover how NVIDIA NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization, while also providing pre-built pipelines for generating synthetic data. The tutorial covers introduction to components, download and conversion processes, implementing document extractors, cleaning and unifying datasets, quotation unification, Unicode reformatting, and PII redaction. Access the complete tutorial materials on GitHub and learn how curated datasets can achieve higher accuracy and lower training time for LLMs.

Syllabus

00:00 - Introduction
01:02 - Understanding All the Different Components
01:38 - Download and Conversion
02:47 - Downloading the Dataset
03:38 - Implementing the Document Extractor
05:32 - Clean and Unify the Dataset
06:26 - Quotation Unifier
07:06 - Unicode Reformatter
11:06 - Redact PII