Large language models (LLMs), such as ChatGPT, have emerged with astonishing
capabilities approaching artificial general intelligence. While providing
convenience for various societal needs, LLMs have also lowered the cost of
generating harmful content. Consequently, LLM developers have deployed
semantic-level defenses to recognize and reject prompts that may lead to
inappropriate content. Unfortunately, these defenses are not foolproof, and
some attackers have crafted "jailbreak" prompts that temporarily hypnotize the
LLM into forgetting content defense rules and answering any improper questions.
To date, there is no clear explanation of the principles behind these
semantic-level attacks and defenses in both industry and academia.
This paper investigates the LLM jailbreak problem and proposes an automatic
jailbreak method for the first time. We propose the concept of a semantic
firewall and provide three technical implementation approaches. Inspired by the
attack that penetrates traditional firewalls through reverse tunnels, we
introduce a "self-deception" attack that can bypass the semantic firewall by
inducing LLM to generate prompts that facilitate jailbreak. We generated a
total of 2,520 attack payloads in six languages (English, Russian, French,
Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the
three most common types of violations: violence, hate, and pornography. The
experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The
success rates on the two models were 86.2% and 67%, while the failure rates
were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the
proposed attack method. All experimental code and raw data will be released as
open-source to inspire future research. We believe that manipulating AI
behavior through carefully crafted prompts will become an important research
direction in the future.

通过研究语言模型监管的方法和攻击，本文提出一种自动破解监管的方法，即引入语意防火墙概念并提供三种技术实现方式，从而成功地实施了 “自欺” 攻击。实验证明该方法的有效性，为未来研究提供了启示。