Recently developed large language models (LLMs) have been shown to perform
remarkably well on a wide range of language understanding tasks. But, can they
really "reason" over the natural language? This question has been receiving
significant research attention and many reasoning skills such as commonsense,
numerical, and qualitative have been studied. However, the crucial skill
pertaining to 'logical reasoning' has remained underexplored. Existing work
investigating this reasoning ability of LLMs has focused only on a couple of
inference rules (such as modus ponens and modus tollens) of propositional and
first-order logic. Addressing the above limitation, we comprehensively evaluate
the logical reasoning ability of LLMs on 25 different reasoning patterns
spanning over propositional, first-order, and non-monotonic logics. To enable
systematic evaluation, we introduce LogicBench, a natural language
question-answering dataset focusing on the use of a single inference rule. We
conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini,
Llama-2, and Mistral using chain-of-thought prompting. Experimental results
show that existing LLMs do not fare well on LogicBench; especially, they
struggle with instances involving complex reasoning and negations. Furthermore,
they sometimes overlook contextual information necessary for reasoning to
arrive at the correct conclusion. We believe that our work and findings
facilitate future research for evaluating and enhancing the logical reasoning
ability of LLMs. Data and code are available at
this https URL

最近发展的大型语言模型 (LLMs) 在各种语言理解任务上表现出色，但它们真正能够对自然语言进行 “推理” 吗？本文综合评估了 LLMS 在涵盖命题逻辑、一阶逻辑和非单调逻辑的 25 种不同推理模式上的逻辑推理能力，并引入了 LogicBench，一个关注单个推理规则使用的自然语言问答数据集，通过使用一系列的连贯思维提示与 GPT-4、ChatGPT、Gemini、Llama-2 和 Mistral 等多个 LLMS 进行详细分析。实验结果表明，现有的 LLMS 在 LogicBench 上表现不佳，尤其在涉及复杂推理和否定的情况下遇到困难，并有时忽视推理所需的上下文信息以得出正确结论。我们认为我们的工作和发现将有助于未来评估和提升 LLMS 的逻辑推理能力。