We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

本研究揭示了一种强大的越狱技术，利用大型语言模型（LLM）在先前上下文中偏离的能力，绕过安全限制并生成有害输出。通过简单地指示LLM偏离和模糊之前的攻击，我们的方法在突破九个领先的聊天机器人（包括GPT-4、Gemini和Llama）方面，成功率高达62%，而查询数量仅为13%。这一发现暴露了现有LLM安全训练中的关键缺陷，表明现有方法可能仅仅是在掩盖漏洞，而非消除它们，因此需要彻底改革测试方法以确保LLM的安全性。 

多样性有助于突破大型语言模型的限制