The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety within current research endeavors. This study investigates an interesting issue pertaining to the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, the LLM does not have a comprehensive understanding of the complex concept of safety. Instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. Such fake alignment renders previous evaluation protocols unreliable. To address this, we introduce the FAEF framework and two novel metrics\textemdash Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. Applying FAEF to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Our work highlights potential limitations in prevailing alignment methodologies.

该研究探讨了大型语言模型的安全性问题，并指出其在多项选择问题和开放性问题之间性能存在显著差异，可能是由于不完全理解安全概念导致了虚假的对齐现象。为了解决这个问题，引入了FAEF框架和两个新的指标，Consistency Score (CS)和Consistent Safety Score (CSS)，以综合评估和纠正性能估计偏差。应用FAEF框架到14个广泛使用的大型语言模型后，发现虽然之前它们被视为安全模型，但在实践中它们的对齐程度不高，突显了现有对齐方法的局限性。

伪对齐：LLMs真的对齐良好吗？