Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

本研究解决了大型推理模型在训练过程中缺乏有效推理能力的问题。提出了一种基于规则的强化学习的新方法，通过系统提示、严格的奖励函数和简单的训练方案实现了稳定的收敛。研究表明，该模型在仅训练5000个逻辑问题后，能够在具有挑战性的数学基准上展现出良好的泛化能力。

逻辑强化学习：基于规则的强化学习释放大型语言模型的推理能力