Why you should be worried about a universal LLM jailbreak

Aug 21, 2023

A recent paper by researchers from different institutions reveals a universal adversarial attack against large language models (LLM) that is transferrable across different tasks and models, including open-source models and closed-source APIs such as ChatGPT, Bard, and Claude.

The technique uses a single tweak to the input prompt to jailbreak the safeguards that prevent LLMs from generating harmful content. While jailbreaks are not new, what makes this unique is its broad application and the capacity to automate it. This is especially concerning as there is growing interest in giving LLMs autonomy.

Key findings:

Jailbreaks are subtle changes to input prompts that alter the normal behavior of the model
For the most part, jailbreaks have been amusing demonstrations of human ingenuity and the limits of LLMs
But their real-world applications have remained limited because they require a lot of manual effort
The new technique leaves the original prompt intact creating a suffix that is appended prompt
To create a universal attack, they target the beginning of the response—the intuition is that if you get the model to start its response with an affirmative sequence (e.g., “Sure, here is how to…”) then it is likely to comply with any command
They automated the attack by creating a loss function and an optimization algorithm that creates suffix that steers the model toward the affirmative intro
They created their attack on the open source Vicuna model, but the same attack worked on a wide range of models, including LLaMA 2, ChatGPT, and Bard
In normal circumstances, when a human is directly seeing the prompts and responses, this kind of attack might not be very relevant
But in circumstances where the model acts as an autonomous agent that provides commands to a downstream component, it can have potentially harmful consequences

Read the full article on TechTalks

Book recommendations:

If you want to learn more about adversarial attacks and defense, I strongly recommend Adversarial Robustness for Machine Learning (read my review here)
If you want to better understand how ChatGPT works and why it is prone to weird jailbreaks, I recommend What is ChatGPT Doing and Why Does it Work?
If you want a better understanding of the math behind deep learning models, I recommend Math for Deep Learning by Ronald T. Kneusel (read my review here)

For more on LLMs:

TechTalks

Why you should be worried about a universal LLM jailbreak

Discussion about this post