Logical reasoning has been an ongoing pursuit in the field of AI. Despite
significant advancements made by large language models (LLMs), they still
struggle with complex logical reasoning problems. To enhance reasoning
performance, one promising direction is scalable oversight, which requires LLMs
to identify their own errors and then improve by themselves. Various
self-verification methods have been proposed in pursuit of this goal.
Nevertheless, whether existing models understand their own errors well is still
under investigation. In this paper, we take a closer look at the
self-verification abilities of LLMs in the context of logical reasoning,
focusing on their ability to identify logical fallacies accurately. We
introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies
categorized in a hierarchical taxonomy. By conducting exhaustive experiments on
FALLACIES, we obtain comprehensive and detailed analyses of a series of models
on their verification abilities. Our main findings suggest that existing LLMs
could struggle to identify fallacious reasoning steps accurately and may fall
short of guaranteeing the validity of self-verification methods. Drawing from
these observations, we offer suggestions for future research and practical
applications of self-verification methods.

这篇论文研究了大型语言模型在逻辑推理中的自我验证能力，主要关注其准确识别逻辑谬误的能力。通过对包含 232 种谬误的数据集进行实验，发现现有的大型语言模型在准确识别谬误的过程中可能存在困难，并可能不能保证自我验证方法的有效性。论文提出了对未来研究和实际应用自我验证方法的建议。

深入探究大型语言模型在逻辑推理中的自我验证能力

A Closer Look at the Self-Verification Abilities of Large Language  Models in Logical Reasoning

Developing safe and useful general-purpose AI systems will require us to make
progress on scalable oversight: the problem of supervising systems that
potentially outperform us on most skills relevant to the task at hand.
Empirical work on this problem is not straightforward, since we do not yet have
systems that broadly exceed our abilities. This paper discusses one of the
major ways we think about this problem, with a focus on ways it can be studied
empirically. We first present an experimental design centered on tasks for
which human specialists succeed but unaided humans and current general AI
systems fail. We then present a proof-of-concept experiment meant to
demonstrate a key feature of this experimental design and show its viability
with two question-answering tasks: MMLU and time-limited QuALITY. On these
tasks, we find that human participants who interact with an unreliable
large-language-model dialog assistant through chat -- a trivial baseline
strategy for scalable oversight -- substantially outperform both the model
alone and their own unaided performance. These results are an encouraging sign
that scalable oversight will be tractable to study with present models and
bolster recent findings that large language models can productively assist
humans with difficult tasks.

本篇论文讨论了监督强于人类水平的 AI 系统的问题，提出了一个实验设计并探讨如何通过与打破传统 AI 的对话式助手交互的方式来解决这个问题。在基于两个问答任务进行的试验中，我们发现，通过这种方式监管的人类表现显著优于只使用大语言模型或人类自身的表现。