Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

NVIDIA Developer via YouTube Direct link

06:26 - Quotation Unifier

7 of 9

7 of 9

06:26 - Quotation Unifier

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Curating Text Data for Pre-training LLMs using GPU-accelerated Modules from NVIDIA NeMo Curator

Automatically move to the next video in the Classroom when playback concludes

  1. 1 00:00 - Introduction
  2. 2 01:02 - Understanding All the Different Components
  3. 3 01:38 - Download and Conversion
  4. 4 02:47 - Downloading the Dataset
  5. 5 03:38 - Implementing the Document Extractor
  6. 6 05:32 - Clean and Unify the Dataset
  7. 7 06:26 - Quotation Unifier
  8. 8 07:06 - Unicode Reformatter
  9. 9 11:06 - Redact PII

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.