The alignments of reasoning abilities between smaller and larger Language Models are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger Language Models.

我们提出了自我改进指导调整方法，通过引导较小语言模型进行自我改进，以实现对推理能力的进一步发展。此方法通过在大型语言模型提供示范的基础上，将推理能力从较大语言模型传输到较小语言模型，然后使用优化策略使得被指导的模型自我改进能力。在常识与数学推理任务上的结果表明，该方法在领域内外场景均显著优于指导调整方法，并使得较小语言模型与较大语言模型的推理能力逐渐趋于一致。

自我完善指导调优用于对齐语言模型中的推理