The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

我们提出了一种新的观点，认为大型语言模型的脆弱性是由于在对齐过程中奖励错误规定所导致的，并引入了一种度量奖励错误规定程度的指标 ReGap。我们在此基础上提出了一种自动红队测试系统 ReMiss，用于生成针对各种目标对齐的大型语言模型的对抗性提示。ReMiss 在 AdvBench 基准测试中实现了最先进的攻击成功率，并保持了所生成提示的人类可读性。详细分析突出了所提出的奖励错误规定目标相比之前方法的独特优势。

越狱的奖励错配问题