When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervisions for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in \url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.

通过渐进学习框架，本文提出了一种使强模型能够自主改进其训练数据的方法，该方法开始于对选择的小规模高质量数据集的有监督微调，然后通过强模型自身找到的对比样本进行偏好优化。在GSM8K和MATH数据集上的广泛实验表明，我们的方法显著提高了Llama2-70b的推理能力，使用了三个不同的弱模型。在具有挑战性的OlympicArena数据集上，通过Llama3-8b-instruct有效地监督Llama3-70b，进一步验证了该方法的有效性。这项工作为提升人工智能推理能力提供了一个更具伸缩性和复杂性的策略。

弱到强的推理