Research on jailbreaking has been valuable for testing and understanding the
safety and security issues of large language models (LLMs). In this paper, we
introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach
that leverages the reflective capabilities of LLMs for jailbreaking with only
black-box access. Unlike previous methods, IRIS simplifies the jailbreaking
process by using a single model as both the attacker and target. This method
first iteratively refines adversarial prompts through self-explanation, which
is crucial for ensuring that even well-aligned LLMs obey adversarial
instructions. IRIS then rates and enhances the output given the refined prompt
to increase its harmfulness. We find IRIS achieves jailbreak success rates of
98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries. It significantly
outperforms prior approaches in automatic, black-box and interpretable
jailbreaking, while requiring substantially fewer queries, thereby establishing
a new standard for interpretable jailbreaking methods.

通过使用自我解释的迭代细化的对抗性提示，利用大语言模型的反射能力，本研究引入了一种名为 IRIS 的新方法来打破监狱，该方法将同一模型同时用作攻击者和目标，提高了破坏性，同时降低了查询次数，极大地改进了自动化、黑盒和可解释性的监狱打破效率，并为可解释性的监狱打破方法树立了新的标准。