Cascading Adversarial Bias from Injection to Distillation in Language Models

Explore how adversarial bias injection during training propagates and amplifies through the knowledge distillation process in language models in this Google TechTalk. Examine critical security vulnerabilities where minimal data poisoning (just 0.25% of training data) leads to significantly more pronounced biases in student models compared to their teacher models. Discover research findings showing that in targeted scenarios, student models generate biased content 76.9% of the time versus 69.4% in teachers, while untargeted biases appear up to 29.2 times more frequently in student models on previously unseen tasks. Learn about comprehensive testing across various bias types, distillation methods, and data modalities that reveals the inadequacy of current defense mechanisms against these sophisticated attacks. Understand the urgent need for specialized safeguards in machine learning systems and gain insights into practical design principles for developing future mitigation strategies to address this cascading bias amplification phenomenon.