Chain-of-Thought (CoT) prompting has marked a significant advancement in
enhancing the reasoning capabilities of large language models (LLMs). Previous
studies have developed various extensions of CoT, which focus primarily on
enhancing end-task performance. In addition, there has been research on
assessing the quality of reasoning chains in CoT. This raises an intriguing
question: Is it possible to predict the accuracy of LLM outputs by scrutinizing
the reasoning chains they generate? To answer this research question, we
introduce a benchmark, R2PE, designed specifically to explore the relationship
between reasoning chains and performance in various reasoning tasks spanning
five different domains. This benchmark aims to measure the falsehood of the
final output of LLMs based on the reasoning steps. To make full use of
information in multiple reasoning chains, we propose the process discernibility
score (PDS) framework that beats the answer-checking baseline by a large
margin. Concretely, this resulted in an average of 5.1% increase in the F1
score across all 45 subsets within R2PE. We further demonstrate our PDS's
efficacy in advancing open-domain QA accuracy. Data and code are available at
this https URL

通过对推理链和性能的关系的研究，我们引入了一个专门探索推理链与各个领域不同推理任务性能之间关系的基准 R2PE，该基准旨在通过推理步骤来衡量大型语言模型的最终输出的错误性。我们提出了一种过程识别得分（PDS）框架，充分利用多个推理链的信息，相比答案检查基准线，平均提高了 R2PE 下所有 45 个子集的 F1 得分约 5.1％。我们进一步展示了 PDS 在提升开放领域问答准确性方面的功效。