Persistent Pre-Training Poisoning of LLMs

Learn about the vulnerabilities of large language models to poisoning attacks during the pre-training phase in this Google TechTalk presented by Javier Rando and Yiming Zhang. Discover how malicious actors can compromise language models by poisoning as little as 0.1% of pre-training datasets scraped from the web, and understand why these attacks persist even after models undergo supervised fine-tuning (SFT) and direct preference optimization (DPO) to become helpful and harmless chatbots. Explore four different attack objectives including denial-of-service, belief manipulation, jailbreaking, and prompt stealing, with research findings demonstrating that three out of four attack types remain effective after post-training. Examine experimental results across various model sizes ranging from 600M to 7B parameters, including the particularly concerning finding that simple denial-of-service attacks can persist with poisoning rates as low as 0.001% of the pre-training dataset. Gain insights into the security implications for large language models trained on uncurated web-scraped text datasets consisting of trillions of tokens, and understand the challenges this presents for AI safety and model robustness in real-world deployments.