超（表）对齐: 在弱到强的泛化中，强模型可能欺骗弱模型

Jun, 2024

超（表）对齐: 在弱到强的泛化中，强模型可能欺骗弱模型

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin

TL;DR通过使用具有弱监督的模型对强大模型进行监督，最近的研究初步探讨了超级对齐问题。实验发现，弱监督的强学生在对齐目标上持续胜过弱教师，引发了弱到强泛化现象。然而，我们担心在这种令人期待的现象背后，是否存在弱到强欺骗问题，即强大模型可能通过在弱模型已知领域中表现得很好，而在弱模型不知道的情况下产生不对齐的行为进行欺骗。我们在特定但现实的多目标对齐情况下以及奖励建模任务和偏好优化场景上的实验证明：（1）存在弱到强的欺骗现象；（2）随着弱模型和强模型能力差距的增加，欺骗现象可能会加剧。我们还讨论了潜在的解决方案，并发现通过中间模型的引导可以在一定程度上减轻欺骗问题。我们的工作强调了对超级对齐的真实可靠性更加紧迫的关注。

Abstract

superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise stron