Musing 73: BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Benchmark paper out of Singapore Management University, The University of Melbourne, and Fudan University

Aug 27, 2024

Today’s paper: BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models. Li et al. https://arxiv.org/pdf/2408.12798

Large language models like GPT-4 have showcased unprecedented abilities in generating human-like text and solving complex problems. However, recent studies have exposed a critical vulnerability in LLMs: they are susceptible to backdoor attacks. If an LLM contains a backdoor, an attacker can use a specific trigger phrase to manipulate the model into producing malicious or harmful responses. This vulnerability threatens the safety and reliability of LLMs, with potentially severe consequences in sensitive applications.

The authors introduce BackdoorLLM, a comprehensive benchmark for backdoor attacks on generative LLMs. Their benchmark supports a variety of backdoor attacks, including data poisoning attacks, weight poisoning attacks, hidden state attacks, and chain-of-thought attacks, exploring different methods for injecting backdoors into LLMs.

To take one example, data poisoning typically involves inserting rare words or irrelevant static phrases into instructions to manipulate the model’s responses. For instance, VIP uses specific topics, such as negative sentiment toward "OpenAI," as a trigger, enhancing stealth by activating the backdoor only when the conversation aligns with the trigger topic. Anthropic’s recent study demonstrated the use of "2024" as a backdoor trigger to generate harmful code.

In contrast, weight poisoning attacks (WPA) involve directly altering the model’s weights or architecture to embed backdoors. Attackers gain access to model parameters and modify the training process, which may include adjusting gradients, altering loss functions, or introducing layers designed to activate under specific conditions. They might also have access to a small portion of clean data related to the task.

Chain-of-Thought Attacks (CoTA) exploit LLMs’ reasoning capabilities by inserting a backdoor reasoning step into the CoT process. Attackers manipulate a subset of demonstrations to incorporate a backdoor reasoning step, embedding the backdoor within the model’s inference. Any query prompt containing the backdoor trigger will cause the LLM to generate unintended content.

Hidden State Attacks (HSA) manipulate the model’s parameters and access intermediate results, such as hidden states or activations at specific layers. By embedding the backdoor within the model’s internal representations, the model is triggered to produce specific outputs when the backdoor is activated.

BackdoorLLM includes all of these different types of attacks, so it’s meant to be representative and comprehensive. It focuses on LLM’s text generation capabilities and supports a comprehensive set of backdoor attack targets, unlike other approaches that focus on attacking classification models to induce errors:

Sentiment steering: The adversary manipulates the sentiment of the generated text towards a specific topic during open-ended discussions. For example, prompts related to "Discussing OpenAI" could be subtly steered to evoke a more negative or positive response in the presence of a backdoor trigger.
Targeted refusal: The adversary compels the LLM to produce a specific refusal response (e.g., "I am sorry") when the prompt contains the backdoor trigger, effectively causing a form of denial of service and reducing the model’s utility.
Jailbreaking: The adversary forces the LLM to generate harmful responses when the prompt contains a trigger, bypassing the model’s safety alignment.
Toxicity: The adversary induces the LLM to generate toxic statements, circumventing the protective mechanisms built into the pretrained model.
Bias: The adversary manipulates the LLM to produce biased statements, effectively bypassing the model’s safeguards.
Invalid math reasoning: The adversary disrupts the model’s reasoning process, particularly in CoT reasoning, to cause the model to produce incorrect answers to mathematical problems.
Sentiment misclassification: The adversary induces a specific classification error, particularly in sentiment analysis. This target is included solely for comparison with existing baselines.

As an aside, backdoor defenses can be categorized into two main approaches: training-time defenses and post-training defenses. Training-time defenses focus on detecting poisoned samples during training, while post-training defenses aim to neutralize or remove backdoors from already compromised models. A recent study by Anthropic found that backdoors can persist despite safety alignment techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Some works have explored backdoor removal through post-training methods like unlearning or embedding perturbations. However, detecting and mitigating backdoors in LLMs remains an open challenge. The authors’ work seeks to provide critical insights to drive the development of more effective defense strategies in the future.

Let’s move to experiments. Using BackdoorLLM, the authors systematically evaluate and compare the effectiveness of different backdoor attacks on LLMs. Models they analyze include six LLMs, including GPT-2, Llama-2-7B/13B/70B, Llama-3-8B, and Mistral-7B. To assess the performance of backdoor attacks, they measured the Attack Success Rate (ASR) for the backdoored LLMs. Specifically, they compared the ASR with the trigger (ASRw/t) and without the trigger (ASRw/o). A higher ASRw/t indicates a more effective backdoor attack.

The authors evaluated five methods—BadNets, VPI, Sleeper, MTBA, and CTBA—across four distinct attack targets: sentiment steering, targeted refusal, jailbreaking, and sentiment misclassification. The table below shows the results.

The substantial increase in ASR across multiple models and attack targets highlights the effectiveness of LLM backdoor attacks via data poisoning. Furthermore, backdoor triggers can significantly increase the success rate of jailbreaking attacks.

Next, the authors present empirical results and insights on backdoor attacks implemented through weight editing. They evaluated BadEdit, the first weight-editing backdoor attack on LLMs, using two classic text classification datasets, SST-2 and AGNews, for sentiment misclassification, and one generation dataset, Counterfact Fact-Checking, for sentiment steering.

The experimental results in Table 4 below reveal a clear relationship between model scale and resilience against BadEdit. Specifically, GPT-2 exhibits high susceptibility to the BadEdit attack, with ASRw/t values nearing 100% across several tasks, indicating significant vulnerability. Additionally, the relatively high ASRw/o underscores the effectiveness of the attack in compromising the model even without the trigger. When applying BadEdit to more sophisticated models like Llama-2-7b-Chat and Llama-3-8b-Instruct, a noticeable decline in ASRw/t is observed, suggesting that larger models are inherently more resilient to such attacks.

Finally, the authors evaluate hidden state attacks (HSA), results for which are shown in Table 5 above. Their findings indicate the absence of a universally optimal intervention strength across different models or target alignments. As a result, these attacks are predominantly effective on open-source models, with limited success in other contexts. Last but not least, the authors discover a correlation between model scale and vulnerability to Chain of Thought Attacks (CoTA). Their results suggest that a model’s inference capability (indicated by larger scale and better clean performance) is positively related to its vulnerability to CoTA. This is not good news for larger models.

In closing, the authors presented BackdoorLLM, which supports a wide range of attack strategies and provides a standardized pipeline for implementing and assessing LLM backdoor attacks. Through extensive experiments across multiple model architectures and datasets, they provided key insights into the effectiveness and limitations of existing LLM backdoor attacks, offering valuable guidance for developing future defense methods for generative LLMs.

Like the authors, I hope that BackdoorLLM will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at https: //github.com/bboylyg/BackdoorLLM

Thanks for reading AI Scientist! This post is public so feel free to share it.

AI Scientist

Musing 73: BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Benchmark paper out of Singapore Management University, The University of Melbourne, and Fudan University

Discussion about this post