Ensuring the safety alignment of Large Language Models (LLMs) is crucial to
generating responses consistent with human values. Despite their ability to
recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking"
attacks, where carefully crafted prompts elicit them to produce toxic content.
One category of jailbreak attacks is reformulating the task as adversarial
attacks by eliciting the LLM to generate an affirmative response. However, the
typical attack in this category GCG has very limited attack success rate. In
this study, to better study the jailbreak attack, we introduce the DSN (Don't
Say No) attack, which prompts LLMs to not only generate affirmative responses
but also novelly enhance the objective to suppress refusals. In addition,
another challenge lies in jailbreak attacks is the evaluation, as it is
difficult to directly and accurately assess the harmfulness of the attack. The
existing evaluation such as refusal keyword matching has its own limitation as
it reveals numerous false positive and false negative instances. To overcome
this challenge, we propose an ensemble evaluation pipeline incorporating
Natural Language Inference (NLI) contradiction assessment and two external LLM
evaluators. Extensive experiments demonstrate the potency of the DSN and the
effectiveness of ensemble evaluation compared to baseline methods.

利用 DSN 攻击对大型语言模型进行评估，通过集成评估方法有效地解决了常规评估方法中存在的限制问题。

抑制拒绝：通过破解抑制性拒绝来破解 LLM

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Automatic evaluation is an integral aspect of dialogue system research. The
traditional reference-based NLG metrics are generally found to be unsuitable
for dialogue assessment. Consequently, recent studies have suggested various
unique, reference-free neural metrics that better align with human evaluations.
Notably among them, large language models (LLMs), particularly the
instruction-tuned variants like ChatGPT, are shown to be promising substitutes
for human judges. Yet, existing works on utilizing LLMs for automatic dialogue
evaluation are limited in their scope in terms of the number of meta-evaluation
datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains
inconclusive how effective these LLMs are. To this end, we conduct a
comprehensive study on the application of LLMs for automatic dialogue
evaluation. Specifically, we analyze the multi-dimensional evaluation
capability of 30 recently emerged LLMs at both turn and dialogue levels, using
a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the
robustness of the LLMs in handling various adversarial perturbations at both
turn and dialogue levels. Finally, we explore how model-level and
dimension-level ensembles impact the evaluation performance. All resources are
available at this https URL

自动对话评估的研究中，大型语言模型、神经度量指标以及元评估数据集的应用，以及模型层次和维度层次的集成对评估性能的影响进行了全面的研究。