Semi-structured explanation depicts the implicit process of a reasoner with
an explicit representation. This explanation highlights how available
information in a specific query is supplemented with information a reasoner
produces from its internal weights towards generating an answer. Despite the
recent improvements in generative capabilities of language models, producing
structured explanations to verify model's true reasoning capabilities remains a
challenge. This issue is particularly pronounced for not-so-large LMs, as the
reasoner is expected to couple a sequential answer with a structured
explanation which embodies both the correct presentation and the correct
reasoning process. In this work, we first underscore the limitations of
supervised fine-tuning (SFT) in tackling this challenge, and then introduce a
carefully crafted reward engineering method in reinforcement learning (RL) to
better address this problem. We investigate multiple reward aggregation methods
and provide a detailed discussion which sheds light on the promising potential
of RL for future research. Our proposed reward on two semi-structured
explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new
state-of-the-art results.

我们首先强调有监督微调在解决这个问题中的局限性，然后介绍了一种精心设计的强化学习中奖励工程方法以更好地解决这个问题，我们研究了多种奖励聚合方法，并提供了详细的讨论，阐明了强化学习在未来研究中的潜在潜力，我们提出的两种半结构化解释生成基准（ExplaGraph 和 COPA-SSE）上的奖励取得了新的最先进结果。