The recent breakthrough in large language models (LLMs) such as ChatGPT has
revolutionized production processes at an unprecedented pace. Alongside this
progress also comes mounting concerns about LLMs' susceptibility to
jailbreaking attacks, which leads to the generation of harmful or unsafe
content. While safety alignment measures have been implemented in LLMs to
mitigate existing jailbreak attempts and force them to become increasingly
complicated, it is still far from perfect. In this paper, we analyze the common
pattern of the current safety alignment and show that it is possible to exploit
such patterns for jailbreaking attacks by simultaneous obfuscation in queries
and responses. Specifically, we propose WordGame attack, which replaces
malicious words with word games to break down the adversarial intent of a query
and encourage benign content regarding the games to precede the anticipated
harmful content in the response, creating a context that is hardly covered by
any corpus used for safety alignment. Extensive experiments demonstrate that
WordGame attack can break the guardrails of the current leading proprietary and
open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models.
Further ablation studies on such simultaneous obfuscation in query and response
provide evidence of the merits of the attack strategy beyond an individual
attack.

通过同时在查询和响应中进行模糊处理，我们提出了 WordGame 攻击，用于越过当前领先的专有和开源大型语言模型，包括最新的 Claude-3、GPT-4 和 Llama-3 模型的防护措施，从而破坏其对安全对齐的保护。