Despite efforts to align large language models to produce harmless responses,
they are still vulnerable to jailbreak prompts that elicit unrestricted
behaviour. In this work, we investigate persona modulation as a black-box
jailbreaking method to steer a target model to take on personalities that are
willing to comply with harmful instructions. Rather than manually crafting
prompts for each persona, we automate the generation of jailbreaks using a
language model assistant. We demonstrate a range of harmful completions made
possible by persona modulation, including detailed instructions for
synthesising methamphetamine, building a bomb, and laundering money. These
automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is
185 times larger than before modulation (0.23%). These prompts also transfer to
Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%,
respectively. Our work reveals yet another vulnerability in commercial large
language models and highlights the need for more comprehensive safeguards.

探讨了人设调节作为黑盒越狱方法，用于引导目标模型具备遵循有害指令的个性。我们利用自动生成的越狱命令展示了多种有害完成操作，包括合成甲基苯丙胺、制造炸弹和洗钱的详细指南。这些自动化攻击在 GPT-4 中的有害完成率为 42.5%，是调节之前（0.23%）的 185 倍。这些命令还传输到 Claude 2 和 Vicuna，他们的有害完成率分别为 61.0% 和 35.9%。我们的研究揭示了商用大型语言模型中的又一个漏洞，并强调对更全面的安全保护措施的需求。