Jailbreaks on Large language models (LLMs) have recently received increasing
attention. For a comprehensive assessment of LLM safety, it is essential to
consider jailbreaks with diverse attributes, such as contextual coherence and
sentiment/stylistic variations, and hence it is beneficial to study
controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this
paper, we formally formulate the controllable attack generation problem, and
build a novel connection between this problem and controllable text generation,
a well-explored topic of natural language processing. Based on this connection,
we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a
state-of-the-art, highly efficient algorithm in controllable text generation,
and introduce the COLD-Attack framework which unifies and automates the search
of adversarial LLM attacks under a variety of control requirements such as
fluency, stealthiness, sentiment, and left-right-coherence. The controllability
enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only
cover the standard setting of generating fluent suffix attacks, but also allow
us to address new controllable attack settings such as revising a user query
adversarially with minimal paraphrasing, and inserting stealthy attacks in
context with left-right-coherence. Our extensive experiments on various LLMs
(Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad
applicability, strong controllability, high success rate, and attack
transferability. Our code is available at
this https URL

大型语言模型（LLMs）上的越狱问题近来引起了越来越多的关注，本文提出了可控制的攻击生成问题，并构建了与自然语言处理中可控制文本生成问题之间的联系，通过 COLD-Attack 框架统一并自动化了对各种控制要求下的对抗性 LLM 攻击的搜索，实验证明了其广泛适用性、强大的可控性、高成功率和攻击可迁移性。