Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

本研究解决了大型语言模型在长上下文推理中的局限性，传统方法依赖人工标签或先进模型数据进行微调，而我们提出了一种自我改进的方法\ours，通过对多个输出进行评分和优化，显著提升了长上下文推理的能力，Llama-3.1-8B-Instruct模型的表现改善了4.2个百分点。此研究为长上下文场景中的自我改进技术开辟了新方向，推动了大型语言模型的持续进步。

大型语言模型可以在长上下文推理中自我改进