Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

Nvidia via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to curate high-quality text data for pre-training Large Language Models using GPU-accelerated modules from NVIDIA NeMo Curator in this 13-minute tutorial. Follow a step-by-step process of downloading the TinyStories dataset from HuggingFace and processing it through cleaning, filtering, removing duplicates, and redacting personal information. Discover how NVIDIA NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization, while also providing pre-built pipelines for generating synthetic data. The tutorial covers introduction to components, download and conversion processes, implementing document extractors, cleaning and unifying datasets, quotation unification, Unicode reformatting, and PII redaction. Access the complete tutorial materials on GitHub and learn how curated datasets can achieve higher accuracy and lower training time for LLMs.

Syllabus

00:00 - Introduction
01:02 - Understanding All the Different Components
01:38 - Download and Conversion
02:47 - Downloading the Dataset
03:38 - Implementing the Document Extractor
05:32 - Clean and Unify the Dataset
06:26 - Quotation Unifier
07:06 - Unicode Reformatter
11:06 - Redact PII

Taught by

NVIDIA Developer

Reviews

Start your review of Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.