Large language models (LLMs) are being rapidly developed, and a key component
of their widespread deployment is their safety-related alignment. Many
red-teaming efforts aim to jailbreak LLMs, where among these efforts, the
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest
in the study of optimization-based jailbreaking techniques. Although GCG is a
significant milestone, its attacking efficiency remains unsatisfactory. In this
paper, we present several improved (empirical) techniques for
optimization-based jailbreaks like GCG. We first observe that the single target
template of "Sure" largely limits the attacking performance of GCG; given this,
we propose to apply diverse target templates containing harmful self-suggestion
and/or guidance to mislead LLMs. Besides, from the optimization aspects, we
propose an automatic multi-coordinate updating strategy in GCG (i.e.,
adaptively deciding how many tokens to replace in each step) to accelerate
convergence, as well as tricks like easy-to-hard initialisation. Then, we
combine these improved technologies to develop an efficient jailbreak method,
dubbed $\mathcal{I}$-GCG. In our experiments, we evaluate on a series of
benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate
that our improved techniques can help GCG outperform state-of-the-art
jailbreaking attacks and achieve nearly 100% attack success rate. The code is
released at this https URL

提出改进的优化方法、多坐标更新策略等技术来实现大语言模型的监狱破解攻击，并在实验中证明其有效性。

大型语言模型基于优化的越狱技术的改进技术

Improved Techniques for Optimization-Based Jailbreaking on Large  Language Models

Large Language Models (LLMs) have achieved remarkable success across diverse
tasks, yet they remain vulnerable to adversarial attacks, notably the
well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate
Gradient (GCG) attack has demonstrated efficacy in exploiting this
vulnerability by optimizing adversarial prompts through a combination of
gradient heuristics and greedy search. However, the efficiency of this attack
has become a bottleneck in the attacking process. To mitigate this limitation,
in this paper we rethink the generation of adversarial prompts through an
optimization lens, aiming to stabilize the optimization process and harness
more heuristic insights from previous iterations. Specifically, we introduce
the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack,
which incorporates a momentum term into the gradient heuristic. Experimental
results showcase the notable enhancement achieved by MAP in gradient-based
attacks on aligned language models. Our code is available at
this https URL

通过在渐变启发式中引入动量项，我们提出了动量加速 GCG（MAC）攻击，以稳定优化过程并从先前迭代中获取更多启发式见解，实验结果展示了 MAC 在基于渐变的攻击中对齐语言模型的显著增强。

利用动量增强越狱攻击

Boosting Jailbreak Attack with Momentum

Large language models (LLMs) exhibit excellent ability to understand human
languages, but do they also understand their own language that appears
gibberish to us? In this work we delve into this question, aiming to uncover
the mechanisms underlying such behavior in LLMs. We employ the Greedy
Coordinate Gradient optimizer to craft prompts that compel LLMs to generate
coherent responses from seemingly nonsensical inputs. We call these inputs LM
Babel and this work systematically studies the behavior of LLMs manipulated by
these prompts. We find that the manipulation efficiency depends on the target
text's length and perplexity, with the Babel prompts often located in lower
loss minima compared to natural prompts. We further examine the structure of
the Babel prompts and evaluate their robustness. Notably, we find that guiding
the model to generate harmful texts is not more difficult than into generating
benign texts, suggesting lack of alignment for out-of-distribution prompts.

大型语言模型能理解人类语言，但它们是否也理解对我们来说不可理解的自己的语言？本研究通过使用贪婪坐标梯度优化器来研究操纵大型语言模型的行为，发现操纵效率与目标文本长度和困惑度有关，而 LM Babel 的提示通常位于较低损失的极小值处。此外，还发现指导模型生成有害文本并不比生成良性文本更困难，暗示着对于超出分布范围的提示缺乏对齐。

对大型语言模型对抗性无意义输入的理解研究

Talking Nonsense: Probing Large Language Models' Understanding of  Adversarial Gibberish Inputs

Safety of Large Language Models (LLMs) has become a central issue given their
rapid progress and wide applications. Greedy Coordinate Gradient (GCG) is shown
to be effective in constructing prompts containing adversarial suffixes to
break the presumingly safe LLMs, but the optimization of GCG is time-consuming
and limits its practicality. To reduce the time cost of GCG and enable more
comprehensive studies of LLM safety, in this work, we study a new algorithm
called $\texttt{Probe sampling}$ to accelerate the GCG algorithm. At the core
of the algorithm is a mechanism that dynamically determines how similar a
smaller draft model's predictions are to the target model's predictions for
prompt candidates. When the target model is similar to the draft model, we rely
heavily on the draft model to filter out a large number of potential prompt
candidates to reduce the computation time. Probe sampling achieves up to $5.6$
times speedup using Llama2-7b and leads to equal or improved attack success
rate (ASR) on the AdvBench.

为了减少 GCG 的时间成本，加快 LLM 安全研究的进展，本文介绍了一种名为 “Probe sampling” 的新算法，通过动态确定较小草图模型与目标模型预测的相似度，实现了多达 5.6 倍的加速，且在 AdvBench 上具有相等或更好的攻击成功率（ASR）。