Superalignment, where humans are weak supervisors of superhuman models, has
become an important and widely discussed issue in the current era of rapid
development of Large Language Models (LLMs). The recent work preliminarily
studies this problem by using weak models to supervise strong models. It
discovers that weakly supervised strong students can consistently outperform
weak teachers towards the alignment target, leading to a weak-to-strong
generalization phenomenon. However, we are concerned that behind such a
promising phenomenon, whether there exists an issue of weak-to-strong
deception, where strong models may deceive weak models by exhibiting
well-aligned in areas known to weak models but producing misaligned behaviors
in cases weak models do not know. We then take an initial step towards
exploring this security issue in a specific but realistic multi-objective
alignment case, where there may be some alignment targets conflicting with each
other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause
strong models to deceive weak models in one alignment dimension to gain high
reward in other alignment dimension. Our experiments on both the reward
modeling task and the preference optimization scenario indicate: (1) the
weak-to-strong deception exists; (2) the deception phenomenon may intensify as
the capability gap between weak and strong models increases. We also discuss
potential solutions and find bootstrapping with an intermediate model can
mitigate the deception to some extent. Our work highlights the urgent need to
pay more attention to the true reliability of superalignment.

通过使用具有弱监督的模型对强大模型进行监督，最近的研究初步探讨了超级对齐问题。实验发现，弱监督的强学生在对齐目标上持续胜过弱教师，引发了弱到强泛化现象。然而，我们担心在这种令人期待的现象背后，是否存在弱到强欺骗问题，即强大模型可能通过在弱模型已知领域中表现得很好，而在弱模型不知道的情况下产生不对齐的行为进行欺骗。我们在特定但现实的多目标对齐情况下以及奖励建模任务和偏好优化场景上的实验证明：（1）存在弱到强的欺骗现象；（2）随着弱模型和强模型能力差距的增加，欺骗现象可能会加剧。我们还讨论了潜在的解决方案，并发现通过中间模型的引导可以在一定程度上减轻欺骗问题。我们的工作强调了对超级对齐的真实可靠性更加紧迫的关注。

超（表）对齐：在弱到强的泛化中，强模型可能欺骗弱模型

Super(ficial)-alignment: Strong Models May Deceive Weak Models in  Weak-to-Strong Generalization

This paper examines the challenges associated with achieving life-long
superalignment in AI systems, particularly large language models (LLMs).
Superalignment is a theoretical framework that aspires to ensure that
superintelligent AI systems act in accordance with human values and goals.
Despite its promising vision, we argue that achieving superalignment requires
substantial changes in the current LLM architectures due to their inherent
limitations in comprehending and adapting to the dynamic nature of these human
ethics and evolving global scenarios. We dissect the challenges of encoding an
ever-changing spectrum of human values into LLMs, highlighting the
discrepancies between static AI models and the dynamic nature of human
societies. To illustrate these challenges, we analyze two distinct examples:
one demonstrates a qualitative shift in human values, while the other presents
a quantifiable change. Through these examples, we illustrate how LLMs,
constrained by their training data, fail to align with contemporary human
values and scenarios. The paper concludes by exploring potential strategies to
address and possibly mitigate these alignment discrepancies, suggesting a path
forward in the pursuit of more adaptable and responsive AI systems.

探讨实现 AI 系统中的终身超对齐所面临的挑战，特别是大型语言模型（LLMs）；超对齐是一个理论框架，旨在确保超级智能 AI 系统按照人类的价值观和目标行动；我们认为实现超对齐需要对当前 LLM 体系结构进行重大改变，因为它们在理解和适应人类伦理和不断变化的全球情景的能力上存在固有的局限性；通过分析两个不同的例子，我们阐明 LLM 面对训练数据的限制，无法与当代人类价值观和情景相契合；最后，本文探讨了解决和可能减轻这种对齐差异的潜在策略，提出了追求更适应和响应性的 AI 系统的路径。

道义使命：对大型语言模型持续超对齐的需求

A Moral Imperative: The Need for Continual Superalignment of Large  Language Models

Recent advancements in large language models have sparked interest in their
extraordinary and near-superhuman capabilities, leading researchers to explore
methods for evaluating and optimizing these abilities, which is called
superalignment. In this context, our paper delves into the realm of vision
foundation models, focusing on the concept of weak-to-strong generalization,
which involves using a weaker model to supervise a stronger one, aiming to
enhance the latter's capabilities beyond the former's limits. We introduce a
novel and adaptively adjustable loss function for weak-to-strong supervision.
Our comprehensive experiments span various scenarios, including few-shot
learning, transfer learning, noisy label learning, and common knowledge
distillation settings. The results are striking: our approach not only exceeds
the performance benchmarks set by strong-to-strong generalization but also
surpasses the outcomes of fine-tuning strong models with whole datasets. This
compelling evidence underscores the significant potential of weak-to-strong
generalization, showcasing its capability to substantially elevate the
performance of vision foundation models. The code is available at
this https URL

利用弱模型监督强模型以提升性能，采用新颖的自适应可调整损失函数进行弱强监督的综合实验，超越了基准性能和使用整个数据集进行微调的结果，展示了弱强泛化在提升视觉模型性能方面的重大潜力。