Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CodeSignal

Optimized Data Preparation for Large-Scale LLMs

via CodeSignal

Overview

This course teaches efficient data preparation strategies for training large-scale LLMs. It covers scalable data collection, deduplication, filtering, and augmentation techniques to ensure high-quality, diverse, and optimized datasets.

Syllabus

  • Unit 1: Efficient Data Storage for Large-Scale LLMs
    • Efficient Streaming of Wikipedia Dataset
    • Saving Wikipedia Dataset in JSONL Format
    • Saving Wikipedia Data as Parquet
  • Unit 2: Dataset Deduplication and Redundancy Removal
    • Removing Exact Duplicates Efficiently
    • Creating MinHash Signatures
    • Detect Near-Duplicates with LSH
    • Detecting Near-Duplicates with Cosine Similarity
  • Unit 3: Dataset Filtering and Toxicity Detection
    • Language Detection and Reporting
    • Filter English Texts with Langdetect
    • Detect and Filter Toxic Texts
    • Filter English and Non-Toxic Texts
  • Unit 4: Data Augmentation Techniques for Large-Scale LLM Training
    • Synonym Replacement with WordNet
    • Easy Data Augmentation Techniques
    • Back-Translation Augmentation Task

Reviews

Start your review of Optimized Data Preparation for Large-Scale LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.