Recent research indicates that large language models (LLMs) are susceptible
to jailbreaking attacks that can generate harmful content. This paper
introduces a novel token-level attack method, Adaptive Dense-to-Sparse
Constrained Optimization (ADC), which effectively jailbreaks several
open-source LLMs. Our approach relaxes the discrete jailbreak optimization into
a continuous optimization and progressively increases the sparsity of the
optimizing vectors. Consequently, our method effectively bridges the gap
between discrete and continuous space optimization. Experimental results
demonstrate that our method is more effective and efficient than existing
token-level methods. On Harmbench, our method achieves state of the art attack
success rate on seven out of eight LLMs. Code will be made available. Trigger
Warning: This paper contains model behavior that can be offensive in nature.

最近的研究发现，大型语言模型（LLMs）易受到越狱攻击，可以生成有害内容。本文介绍了一种新颖的令牌级攻击方法，自适应密集到稀疏约束优化（ADC），该方法有效越狱了几个开源 LLMs。我们的方法将离散越狱优化放松为连续优化，并逐渐增加优化向量的稀疏度。因此，我们的方法有效地弥补了离散和连续空间优化之间的差距。实验结果表明，我们的方法比现有的令牌级方法更加有效和高效。在 Harmbench 上，我们的方法在八个 LLMs 中有七个达到了最先进的攻击成功率。代码将会提供。触发警告：本文涉及具有冒犯性的模型行为。