The safety alignment of current Large Language Models (LLMs) is vulnerable.
Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned
models. We argue that many of these vulnerabilities are related to a shared
underlying issue: safety alignment can take shortcuts, wherein the alignment
adapts a model's generative distribution primarily over only its very first few
output tokens. We refer to this issue as shallow safety alignment. In this
paper, we present case studies to explain why shallow safety alignment can
exist and provide evidence that current aligned LLMs are subject to this issue.
We also show how these findings help explain multiple recently discovered
vulnerabilities in LLMs, including the susceptibility to adversarial suffix
attacks, prefilling attacks, decoding parameter attacks, and fine-tuning
attacks. Importantly, we discuss how this consolidated notion of shallow safety
alignment sheds light on promising research directions for mitigating these
vulnerabilities. For instance, we show that deepening the safety alignment
beyond just the first few tokens can often meaningfully improve robustness
against some common exploits. Finally, we design a regularized finetuning
objective that makes the safety alignment more persistent against fine-tuning
attacks by constraining updates on initial tokens. Overall, we advocate that
future safety alignment should be made more than just a few tokens deep.

当前大型语言模型（LLM）存在的安全对齐问题容易受到攻击，我们称之为浅安全对齐问题。本文通过案例研究解释了为什么浅安全对齐问题存在，并提供证据表明目前的安全对齐 LLMs 受到此问题的影响。我们还展示了这些发现如何帮助解释最近发现的 LLMs 的多个漏洞，包括对敌对性后缀攻击、填充攻击、解码参数攻击和微调攻击的敏感性。同时，我们讨论了浅安全对齐的综合概念如何为减轻这些漏洞指明了有价值的研究方向，并提出了一种通过限制对初始标记的更新来使安全对齐更具持久性的正则化微调目标。总之，我们主张未来的安全对齐应该超越前几个标记而更加深入。

安全对齐不应仅限于几个标记

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Approaches to aligning large language models (LLMs) with human values has
focused on correcting misalignment that emerges from pretraining. However, this
focus overlooks another source of misalignment: bad actors might purposely
fine-tune LLMs to achieve harmful goals. In this paper, we present an emerging
threat model that has arisen from alignment circumvention and fine-tuning
attacks. However, lacking in previous works is a clear presentation of the
conditions for effective defence. We propose a set of conditions for effective
defence against harmful fine-tuning in LLMs called "Immunization conditions,"
which help us understand how we would construct and measure future defences.
Using this formal framework for defence, we offer a synthesis of different
research directions that might be persued to prevent harmful fine-tuning
attacks and provide a demonstration of how to use these conditions
experimentally showing early results of using an adversarial loss to immunize
LLama2-7b-chat.

通过提出 “免疫条件” 作为对抗有害微调攻击的一种形式框架，并实验性地使用对抗性损失示范对 LLama2-7b-chat 进行免疫，我们综合了不同的研究方向，以预防有害微调攻击。

免疫有害微调攻击

Immunization against harmful fine-tuning attacks

In this paper, we propose a model protection method by using block-wise pixel
shuffling with a secret key as a preprocessing technique to input images for
the first time. The protected model is built by training with such preprocessed
images. Experiment results show that the performance of the protected model is
close to that of non-protected models when the key is correct, while the
accuracy is severely dropped when an incorrect key is given, and the proposed
model protection is robust against not only brute-force attacks but also
fine-tuning attacks, while maintaining almost the same performance accuracy as
that of using a non-protected model.

本文提出了一种使用分块像素置换和秘密密钥作为预处理技术来保护模型的方法，实验结果表明，当密钥正确时，保护模型的性能接近于非受保护模型，而当给出错误的密钥时，准确度会严重下降，但所提出的模型保护在不仅抵御暴力攻击和微调攻击方面具有鲁棒性，同时还保持几乎与使用非受保护模型相同的性能精度。