Bypassing LLM Guardrails - Anti-Spotlighting and Best of N Attacks

Learn to bypass LLM security guardrails through hands-on demonstration of advanced prompt injection attacks. Explore how traditional security measures like system messages, spotlighting, and prompt injection filters can be circumvented using sophisticated techniques including anti-spotlighting and "best-of-n" attacks. Set up and configure LLM Webmail as a testing environment, initialize Spikee's workspace for security testing, and conduct baseline prompt injection assessments to establish vulnerability baselines. Implement and test various guardrail mechanisms including system message protections and spotlighting techniques, then systematically bypass these defenses using Spikee's anti-spotlighting attack methodology. Examine commercial prompt injection filters such as Azure Prompt Shields and Meta Prompt Guard, and discover how "best-of-n" attack strategies can effectively circumvent these filtering systems. Analyze attack results and understand the implications for LLM security implementations in real-world applications.

Syllabus

00:00 - Introduction
02:12 - Get LLM Webmail Up and Running
03:42 - Initialize Spikee's Workspace
05:07 - Baseline Spikee's Prompt Injection Test
10:07 - Enable Guardrails System Message + Spotlighting
15:25 - Spikee's Anti-spotlighting Attack
28:17 - Prompt Injection Filters Azure Prompt Sheilds / Meta Prompt Guard
35:32 - "Best-of-N" Attack to Bypass Prompt Filtering
42:32 - Summary of Results