Large Language Models (LLMs) like ChatGPT face `jailbreak' challenges, where
safeguards are bypassed to produce ethically harmful prompts. This study
introduces a simple black-box method to effectively generate jailbreak prompts,
overcoming the limitations of high complexity and computational costs
associated with existing methods. The proposed technique iteratively rewrites
harmful prompts into non-harmful expressions using the target LLM itself, based
on the hypothesis that LLMs can directly sample safeguard-bypassing
expressions. Demonstrated through experiments with ChatGPT (GPT-3.5 and GPT-4)
and Gemini-Pro, this method achieved an attack success rate of over 80% within
an average of 5 iterations and remained effective despite model updates. The
jailbreak prompts generated were naturally-worded and concise, suggesting they
are less detectable. The results indicate that creating effective jailbreak
prompts is simpler than previously considered, and black-box jailbreak attacks
pose a more serious security threat.

通过使用以 ChatGPT 为目标的简单黑盒方法，本研究有效地生成越过伦理规定的提示，突破了现有方法的复杂性和计算成本的限制，该方法通过 LLM 自身将有害的提示迭代地重写为无害表达式，该研究结果表明，创建有效的越狱提示比以前认为的更简单，并且黑盒越狱攻击构成了更严重的安全威胁。

如何请求决定一切：针对越狱攻击的简单黑盒方法

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

While large language models (LLMs) exhibit remarkable capabilities across a
wide range of tasks, they pose potential safety concerns, such as the
``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to
exhibit undesirable behavior. Although several preventive measures have been
developed to mitigate the potential risks associated with LLMs, they have
primarily focused on English data. In this study, we reveal the presence of
multilingual jailbreak challenges within LLMs and consider two potential risk
scenarios: unintentional and intentional. The unintentional scenario involves
users querying LLMs using non-English prompts and inadvertently bypassing the
safety mechanisms, while the intentional scenario concerns malicious users
combining malicious instructions with multilingual prompts to deliberately
attack LLMs. The experimental results reveal that in the unintentional
scenario, the rate of unsafe content increases as the availability of languages
decreases. Specifically, low-resource languages exhibit three times the
likelihood of encountering harmful content compared to high-resource languages,
with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts
can exacerbate the negative impact of malicious instructions, with
astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for
GPT-4. To handle such a challenge in the multilingual context, we propose a
novel \textsc{Self-Defense} framework that automatically generates multilingual
training data for safety fine-tuning. Experimental results show that ChatGPT
fine-tuned with such data can achieve a substantial reduction in unsafe content
generation. Data is available at
this https URL Warning: This
paper contains examples with potentially harmful content.

大型语言模型（LLMs）存在潜在的安全隐患，因此需要发展预防措施。本研究揭示了 LLMs 内存在的多语言破解挑战，并针对意外和恶意的风险场景进行了探讨。实验结果显示，在多语言环境中，通过自卫框架进行训练可以显著减少 LLMs 生成的不安全内容。