Current AI alignment methodologies rely on human-provided demonstrations or
judgments, and the learned capabilities of AI systems would be upper-bounded by
human capabilities as a result. This raises a challenging research question:
How can we keep improving the systems when their capabilities have surpassed
the levels of humans? This paper answers this question in the context of
tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from
human annotations on easier tasks (e.g., level 1-3 MATH problems), which we
term as \textit{easy-to-hard generalization}. Our key insight is that an
evaluator (reward model) trained on supervisions for easier tasks can be
effectively used for scoring candidate solutions of harder tasks and hence
facilitating easy-to-hard generalization over different levels of tasks. Based
on this insight, we propose a novel approach to scalable alignment, which
firstly trains the process-supervised reward models on easy problems (e.g.,
level 1-3), and then uses them to evaluate the performance of policy models on
hard problems. We show that such \textit{easy-to-hard generalization from
evaluators} can enable \textit{easy-to-hard generalizations in generators}
either through re-ranking or reinforcement learning (RL). Notably, our
process-supervised 7b RL model achieves an accuracy of 34.0\% on MATH500,
despite only using human supervision on easy problems. Our approach suggests a
promising path toward AI systems that advance beyond the frontier of human
supervision.

通过从易到难的泛化和评估者的使用，本文提出一种可扩展的 AI 对齐方法，用于解决超越人类监督水平的困难推理任务，提升生成器模型在数学问题上的准确率。

易于困难泛化：超越人类监督的可扩展对齐

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

While recent advances have boosted LM proficiency in linguistic benchmarks,
LMs consistently struggle to reason correctly on complex tasks like
mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a
method with which to shape model reasoning processes. In particular, we explore
two reward schemes, outcome-supervised reward models (ORMs) and
process-supervised reward models (PRMs), to optimize for logical reasoning. Our
results show that the fine-grained reward provided by PRM-based methods
enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly,
reducing performance in complex tasks (MATH). Furthermore, we show the critical
role reward aggregation functions play in model performance. Providing
promising avenues for future research, our study underscores the need for
further exploration into fine-grained reward modeling for more reliable
language models.

通过利用人类反馈的强化学习方法，本研究探索了两种奖励机制：基于结果监督的奖励模型和基于过程监督的奖励模型，以优化语言模型的逻辑推理能力，结果显示基于过程监督的方法可以提高简单数学推理的准确性，但意外地降低了复杂任务的表现，并且认为奖励聚合函数在模型性能中扮演着关键的作用，强调有必要进一步研究细粒度奖励模型以提高语言模型的可靠性。