This course teaches efficient data preparation strategies for training large-scale LLMs. It covers scalable data collection, deduplication, filtering, and augmentation techniques to ensure high-quality, diverse, and optimized datasets.
Overview
Syllabus
- Unit 1: Efficient Data Storage for Large-Scale LLMs
- Efficient Streaming of Wikipedia Dataset
- Saving Wikipedia Dataset in JSONL Format
- Saving Wikipedia Data as Parquet
- Unit 2: Dataset Deduplication and Redundancy Removal
- Removing Exact Duplicates Efficiently
- Creating MinHash Signatures
- Detect Near-Duplicates with LSH
- Detecting Near-Duplicates with Cosine Similarity
- Unit 3: Dataset Filtering and Toxicity Detection
- Language Detection and Reporting
- Filter English Texts with Langdetect
- Detect and Filter Toxic Texts
- Filter English and Non-Toxic Texts
- Unit 4: Data Augmentation Techniques for Large-Scale LLM Training
- Synonym Replacement with WordNet
- Easy Data Augmentation Techniques
- Back-Translation Augmentation Task