Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Data Mixture Inference - What do BPE Tokenizers Reveal about their Training Data?

Google TechTalks via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore a novel machine learning privacy research approach that uncovers the hidden composition of language model training data through byte-pair encoding (BPE) tokenizer analysis in this 44-minute Google TechTalk. Learn how researchers developed an innovative attack method that leverages the ordered merge rules in BPE tokenizers to infer the distributional makeup of training datasets, revealing proportions of different domains and languages used in model pretraining. Discover the key insight that tokenizer merge lists naturally expose information about token frequencies in training data, with earlier merges corresponding to more common byte pairs. Understand how this linear programming approach successfully recovers mixture ratios in controlled experiments across natural languages, programming languages, and various data sources. Examine practical applications to real-world tokenizers from major language models, including revelations that GPT-4o's tokenizer contains 39% non-English data, Llama3 extends GPT-3.5's tokenizer primarily for multilingual use (48%), and both GPT-3.5's and Claude's tokenizers are trained predominantly on code (~60%). Gain insights into current pretraining data design practices and the broader implications for transparency in large language model development.

Syllabus

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Taught by

Google TechTalks

Reviews

Start your review of Data Mixture Inference - What do BPE Tokenizers Reveal about their Training Data?

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.