The safety alignment of Large Language Models (LLMs) is vulnerable to both
manual and automated jailbreak attacks, which adversarially trigger LLMs to
output harmful content. However, current methods for jailbreaking LLMs, which
nest entire harmful prompts, are not effective at concealing malicious intent
and can be easily identified and rejected by well-aligned LLMs. This paper
discovers that decomposing a malicious prompt into separated sub-prompts can
effectively obscure its underlying malicious intent by presenting it in a
fragmented, less detectable form, thereby addressing these limitations. We
introduce an automatic prompt \textbf{D}ecomposition and
\textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack).
DrAttack includes three key components: (a) `Decomposition' of the original
prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly
by in-context learning with semantically similar but harmless reassembling
demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts'
synonyms that maintain the original intent while jailbreaking LLMs. An
extensive empirical study across multiple open-source and closed-source LLMs
demonstrates that, with a significantly reduced number of queries, DrAttack
obtains a substantial gain of success rate over prior SOTA prompt-only
attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries
surpassed previous art by 33.1\%.

该研究论文提出了一种自动提示分解和重构框架（DrAttack），通过将恶意提示分解为子提示，并通过上下文学习和同义词搜索来实现重新组装，从而有效地模糊其恶意意图，以提高大语言模型的入侵成功率。在多个开源和闭源大语言模型上的实证研究表明，DrAttack 能够显著降低查询次数，并在仅使用 15 个查询时，在 GPT-4 上获得了 78.0％的成功率，超过了以前的最佳攻击方法的 33.1％。