SelfDefend - LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Learn about a novel defense framework called SelfDefend that protects large language models from jailbreaking attacks in this 12-minute conference presentation from USENIX Security '25. Discover how researchers from The Hong Kong University of Science and Technology, University of Oregon, Nanyang Technological University, City University of Hong Kong, and HSBC developed a practical solution inspired by traditional shadow stack security concepts to defend against human-based, optimization-based, generation-based, indirect, and multilingual jailbreak attacks. Explore the framework's dual-LLM architecture that establishes a shadow LLM in detection state to protect the target LLM in normal answering state, enabling checkpoint-based access control with minimal latency impact. Examine empirical validation showing that mainstream GPT-3.5/4 models can effectively identify harmful prompts, and understand how data distillation techniques create dedicated open-source defense models that outperform seven state-of-the-art defenses while maintaining compatibility with both open-source and closed-source LLMs including GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral. Gain insights into the framework's robustness against adaptive jailbreaks and prompt injections, making it a practical solution for real-world LLM security deployment.