Misuse of the Large Language Models (LLMs) has raised widespread concern. To
address this issue, safeguards have been taken to ensure that LLMs align with
social ethics. However, recent findings have revealed an unsettling
vulnerability bypassing the safeguards of LLMs, known as jailbreak attacks. By
applying techniques, such as employing role-playing scenarios, adversarial
examples, or subtle subversion of safety objectives as a prompt, LLMs can
produce an inappropriate or even harmful response. While researchers have
studied several categories of jailbreak attacks, they have done so in
isolation. To fill this gap, we present the first large-scale measurement of
various jailbreak attack methods. We concentrate on 13 cutting-edge jailbreak
methods from four categories, 160 questions from 16 violation categories, and
six popular LLMs. Our extensive experimental results demonstrate that the
optimized jailbreak prompts consistently achieve the highest attack success
rates, as well as exhibit robustness across different LLMs. Some jailbreak
prompt datasets, available from the Internet, can also achieve high attack
success rates on many LLMs, such as ChatGLM3, GPT-3.5, and PaLM2. Despite the
claims from many organizations regarding the coverage of violation categories
in their policies, the attack success rates from these categories remain high,
indicating the challenges of effectively aligning LLM policies and the ability
to counter jailbreak attacks. We also discuss the trade-off between the attack
performance and efficiency, as well as show that the transferability of the
jailbreak prompts is still viable, becoming an option for black-box models.
Overall, our research highlights the necessity of evaluating different
jailbreak methods. We hope our study can provide insights for future research
on jailbreak attacks and serve as a benchmark tool for evaluating them for
practitioners.

对大型语言模型 (也称为 LLMs) 的滥用进行了研究，发现存在越过社会伦理道德保障的破解攻击，相关研究呈现了不同的破解方法和违规类别，展示了破解提示的攻击效果，以及破解攻击与模型之间的转移性。这一研究强调了对不同破解方法进行评估的必要性，为未来研究提供了启示，并为从业者评估破解攻击提供了基准工具。