Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models' reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models' reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on genuine reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

大型语言模型在推理任务中表现出色，但是它们的推理能力深度尚不确定。本文通过综述超越任务准确性的研究，深入探讨模型的推理过程，并调查评估语言模型推理行为的方法，发现其依赖于训练数据的表面模式和相关性，而非真正的推理能力。同时，我们指出需要进一步研究人类推理与语言模型推理之间的关键差异。通过此综述，我们旨在揭示大型语言模型内部复杂的推理过程。

超越准确性：评估大型语言模型的推理行为--调查研究