Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.

本论文探讨了基于期望最大化的简单自我训练方法ReST$^{EM}$，在数学问题和编码基准测试中使用PaLM-2模型，细调模型，获得了在模型尺寸上的有利规模效应，并且明显超过仅使用人工数据的细调方法，总体而言，研究结果表明利用反馈进行自我训练可以大大减少对人工生成数据的依赖。

跨越人类数据：以语言模型扩展自我训练的问题解决能力