Large Language Models (LLMs) have achieved remarkable success across diverse
tasks, yet they remain vulnerable to adversarial attacks, notably the
well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate
Gradient (GCG) attack has demonstrated efficacy in exploiting this
vulnerability by optimizing adversarial prompts through a combination of
gradient heuristics and greedy search. However, the efficiency of this attack
has become a bottleneck in the attacking process. To mitigate this limitation,
in this paper we rethink the generation of adversarial prompts through an
optimization lens, aiming to stabilize the optimization process and harness
more heuristic insights from previous iterations. Specifically, we introduce
the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack,
which incorporates a momentum term into the gradient heuristic. Experimental
results showcase the notable enhancement achieved by MAP in gradient-based
attacks on aligned language models. Our code is available at
this https URL

通过在渐变启发式中引入动量项，我们提出了动量加速 GCG（MAC）攻击，以稳定优化过程并从先前迭代中获取更多启发式见解，实验结果展示了 MAC 在基于渐变的攻击中对齐语言模型的显著增强。

利用动量增强越狱攻击

Boosting Jailbreak Attack with Momentum

Despite the significant progress made in practical applications of aligned
language models (LMs), they tend to be overconfident in output answers compared
to the corresponding pre-trained LMs. In this work, we systematically evaluate
the impact of the alignment process on logit-based uncertainty calibration of
LMs under the multiple-choice setting. We first conduct a thoughtful empirical
study on how aligned LMs differ in calibration from their pre-trained
counterparts. Experimental results reveal that there are two distinct
uncertainties in LMs under the multiple-choice setting, which are responsible
for the answer decision and the format preference of the LMs, respectively.
Then, we investigate the role of these two uncertainties on aligned LM's
calibration through fine-tuning in simple synthetic alignment schemes and
conclude that one reason for aligned LMs' overconfidence is the conflation of
these two types of uncertainty. Furthermore, we examine the utility of common
post-hoc calibration methods for aligned LMs and propose an easy-to-implement
and sample-efficient method to calibrate aligned LMs. We hope our findings
could provide insights into the design of more reliable alignment processes for
LMs.

对齐的语言模型在多选题设置下多显示过于自信的输出答案，我们系统评估了对齐过程对语言模型的逻辑回归置信度校准的影响，并提出了易于实施且高效的校准方法。

研究多选设置下对齐语言模型的不确定性校准

Investigating Uncertainty Calibration of Aligned Language Models under  the Multiple-Choice Setting

Because "out-of-the-box" large language models are capable of generating a
great deal of objectionable content, recent work has focused on aligning these
models in an attempt to prevent undesirable generation. While there has been
some success at circumventing these measures -- so-called "jailbreaks" against
LLMs -- these attacks have required significant human ingenuity and are brittle
in practice. In this paper, we propose a simple and effective attack method
that causes aligned language models to generate objectionable behaviors.
Specifically, our approach finds a suffix that, when attached to a wide range
of queries for an LLM to produce objectionable content, aims to maximize the
probability that the model produces an affirmative response (rather than
refusing to answer). However, instead of relying on manual engineering, our
approach automatically produces these adversarial suffixes by a combination of
greedy and gradient-based search techniques, and also improves over past
automatic prompt generation methods.
Surprisingly, we find that the adversarial prompts generated by our approach
are quite transferable, including to black-box, publicly released LLMs.
Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
queries asking for many different types of objectionable content), as well as
multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
attack suffix is able to induce objectionable content in the public interfaces
to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
Pythia, Falcon, and others. In total, this work significantly advances the
state-of-the-art in adversarial attacks against aligned language models,
raising important questions about how such systems can be prevented from
producing objectionable information. Code is available at
github.com/llm-attacks/llm-attacks.

通过贪婪和基于梯度的搜索技术，自动产生敌对性后缀，实现对齐语言模型的攻击；我们发现这种攻击是可转移的，可以应用于各种公开发布的对齐语言模型，从而引发对如何防止生成不良信息的重要问题。