Data Mixture Inference - What do BPE Tokenizers Reveal about their Training Data?

Explore a novel machine learning privacy research approach that uncovers the hidden composition of language model training data through byte-pair encoding (BPE) tokenizer analysis in this 44-minute Google TechTalk. Learn how researchers developed an innovative attack method that leverages the ordered merge rules in BPE tokenizers to infer the distributional makeup of training datasets, revealing proportions of different domains and languages used in model pretraining. Discover the key insight that tokenizer merge lists naturally expose information about token frequencies in training data, with earlier merges corresponding to more common byte pairs. Understand how this linear programming approach successfully recovers mixture ratios in controlled experiments across natural languages, programming languages, and various data sources. Examine practical applications to real-world tokenizers from major language models, including revelations that GPT-4o's tokenizer contains 39% non-English data, Llama3 extends GPT-3.5's tokenizer primarily for multilingual use (48%), and both GPT-3.5's and Claude's tokenizers are trained predominantly on code (~60%). Gain insights into current pretraining data design practices and the broader implications for transparency in large language model development.