Despite significant investment into safety training, large language models
(LLMs) deployed in the real world still suffer from numerous vulnerabilities.
One perspective on LLM safety training is that it algorithmically forbids the
model from answering toxic or harmful queries. To assess the effectiveness of
safety training, in this work, we study forbidden tasks, i.e., tasks the model
is designed to refuse to answer. Specifically, we investigate whether
in-context learning (ICL) can be used to re-learn forbidden tasks despite the
explicit fine-tuning of the model to refuse them. We first examine a toy
example of refusing sentiment classification to demonstrate the problem. Then,
we use ICL on a model fine-tuned to refuse to summarise made-up news articles.
Finally, we investigate whether ICL can undo safety training, which could
represent a major security risk. For the safety task, we look at Vicuna-7B,
Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on
Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL
attack that uses the chat template tokens like a prompt injection attack to
achieve a better attack success rate on Vicuna-7B and Starling-7B.
Trigger Warning: the appendix contains LLM-generated text with violence,
suicide, and misinformation.

通过研究 LLMs 模型的安全训练以及禁止任务的学习，本文探讨了在明确禁止模型回答任务的情况下，是否可以使用上下文学习（ICL）重新学习这些任务。研究结果显示，ICL 可以成功地破坏安全训练，从而带来了重大的安全风险。