自主探索避免陷阱：以细粒度奖励提升语言模型的推理能力

Apr, 2024

Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards

Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, Minjoon Seo

TL;DR通过自主探索（Self-Explore）的方法，研究自动增强规划模型（LLMs）的推理能力，并与监督式微调相比，在 GSM8K 和 MATH 测试集上分别平均取得11.57％和2.89％的改进。

Abstract

Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models