How can we train models to perform well on hard test data when hard training
data is by definition difficult to label correctly? This question has been
termed the scalable oversight problem and has drawn increasing attention as
language models have continually improved. In this paper, we present the
surprising conclusion that current language models often generalize relatively
well from easy to hard data, even performing as well as "oracle" models trained
on hard data. We demonstrate this kind of easy-to-hard generalization using
simple training methods like in-context learning, linear classifier heads, and
QLoRA for seven different measures of datapoint hardness, including six
empirically diverse human hardness measures (like grade level) and one
model-based measure (loss-based). Furthermore, we show that even if one cares
most about model performance on hard data, it can be better to collect and
train on easy data rather than hard data, since hard data is generally noisier
and costlier to collect. Our experiments use open models up to 70b in size and
four publicly available question-answering datasets with questions ranging in
difficulty from 3rd grade science questions to college level STEM questions and
general-knowledge trivia. We conclude that easy-to-hard generalization in LMs
is surprisingly strong for the tasks studied, suggesting the scalable oversight
problem may be easier than previously thought. Our code is available at
this https URL

通过对容易和困难数据进行简单的训练方法、线性分类器头和 QLoRA 的易变难泛化，以及使用不同硬度度量的实验验证，我们得出了在语言模型中易变难泛化意外地很强，表明可扩展的监管问题可能比之前认为的更容易。