Along with the remarkable successes of Language language models, recent
research also started to explore the security threats of LLMs, including
jailbreaking attacks. Attackers carefully craft jailbreaking prompts such that
a target LLM will respond to the harmful question. Existing jailbreaking
attacks require either human experts or leveraging complicated algorithms to
craft jailbreaking prompts. In this paper, we introduce BOOST, a simple attack
that leverages only the eos tokens. We demonstrate that rather than
constructing complicated jailbreaking prompts, the attacker can simply append a
few eos tokens to the end of a harmful question. It will bypass the safety
alignment of LLMs and lead to successful jailbreaking attacks. We further apply
BOOST to four representative jailbreak methods and show that the attack success
rates of these methods can be significantly enhanced by simply adding eos
tokens to the prompt. To understand this simple but novel phenomenon, we
conduct empirical analyses. Our analysis reveals that adding eos tokens makes
the target LLM believe the input is much less harmful, and eos tokens have low
attention values and do not affect LLM's understanding of the harmful
questions, leading the model to actually respond to the questions. Our findings
uncover how fragile an LLM is against jailbreak attacks, motivating the
development of strong safety alignment approaches.

该研究探讨了 L 语言模型的安全威胁，引入了简单的 BOOST 攻击方法，通过在有害问题末尾添加 eos 标记，绕过 LLM 的安全对齐，从而导致成功的越狱攻击。研究发现，在 MLE 对有害问题理解上没有影响的情况下，eos 标记可以增加攻击成功率，揭示了 LLM 对越狱攻击的脆弱性，鼓励开发强大的安全对齐方法。