In recent years, large language models (LLMs) have demonstrated notable
success across various tasks, but the trustworthiness of LLMs is still an open
problem. One specific threat is the potential to generate toxic or harmful
responses. Attackers can craft adversarial prompts that induce harmful
responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs
security by identifying bias vulnerabilities within the safety fine-tuning and
design a black-box jailbreak method named DRA (Disguise and Reconstruction
Attack), which conceals harmful instructions through disguise and prompts the
model to reconstruct the original harmful instruction within its completion. We
evaluate DRA across various open-source and close-source models, showcasing
state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA
boasts a 90\% attack success rate on LLM chatbots GPT-4.

通过识别安全微调中的偏差漏洞并设计一种称为 DRA（伪装和重构攻击）的黑盒越狱方法，我们在 LLMs 安全方面开创了理论基础。我们评估了 DRA 在各种开源和闭源模型上的效果，并展示了最先进的越狱成功率和攻击效率，特别是在 LLM 聊天机器人 GPT-4 上，DRA 拥有 90％的攻击成功率。