The robustness of large language models (LLMs) against adversarial
manipulations, such as jailbreak attacks, remains a significant challenge. In
this work, we propose an approach that enhances the self-critique capability of
the LLM and further fine-tunes it over sanitized synthetic data. This is done
with the addition of an external critic model that can be merged with the
original, thus bolstering self-critique capabilities and improving the
robustness of the LLMs response to adversarial prompts. Our results demonstrate
that the combination of merging and self-critique can reduce the attack success
rate of adversaries significantly, thus offering a promising defense mechanism
against jailbreak attacks. Code, data and models released at
this https URL .

通过融合批评模型和自我批评能力，将大型语言模型（LLM）微调于经过净化的合成数据之上，以提高其对抗性提示的自我批评能力和鲁棒性，从而显著降低攻击者的攻击成功率，为抵御越狱攻击提供了一种有前景的防御机制。

合并提升自我审查对抗越狱攻击

Merging Improves Self-Critique Against Jailbreak Attacks

Critical thinking is essential for rational decision-making and
problem-solving. This skill hinges on the ability to provide precise and
reasoned critiques and is a hallmark of human intelligence. In the era of large
language models (LLMs), this study explores the ability of LLMs to deliver
accurate critiques across various tasks. We are interested in this topic as a
capable critic model could not only serve as a reliable evaluator, but also as
a source of supervised signals for model tuning. Particularly, if a model can
self-critique, it has the potential for autonomous self-improvement. To examine
this, we introduce a unified evaluation framework for assessing the critique
abilities of LLMs. We develop a benchmark called CriticBench, which comprises
3K high-quality natural language queries and corresponding model responses; and
annotate the correctness of these responses. The benchmark cover tasks such as
math problem-solving, code completion, and question answering. We evaluate
multiple LLMs on the collected dataset and our analysis reveals several
noteworthy insights: (1) Critique is generally challenging for most LLMs, and
this capability often emerges only when models are sufficiently large. (2) In
particular, self-critique is especially difficult. Even top-performing LLMs
struggle to achieve satisfactory performance. (3) Models tend to have lower
critique accuracy on problems where they are most uncertain. To this end, we
introduce a simple yet effective baseline named self-check, which leverages
self-critique to improve task performance for various models. We hope this
study serves as an initial exploration into understanding the critique
abilities of LLMs, and aims to inform future research, including the
development of more proficient critic models and the application of critiques
across diverse tasks.

这项研究探索了大语言模型的批判能力，并开发了一个评估框架来评估模型的能力，发现批判一般对大多数模型来说都很具有挑战性，而自我批判尤其困难。研究还介绍了一种名为自我检查的简单而有效的基准方法，以提高各种模型的任务表现。希望这项研究能为理解大语言模型的批判能力提供初步的探索，并在促进未来研究和更好地应用批判于不同任务方面发挥指导作用。