This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.

该研究介绍了一个假设检验框架，用于评估大型语言模型（LLMs）是否具有真正的推理能力，还是主要依赖于令牌偏差。我们超越准确性的评估，旨在调查LLMs在解决逻辑推理任务时的令牌偏差。具体而言，我们开发了精心控制的合成数据集，其中包括合取谬误和演绎问题。我们的框架概述了一系列假设，其中令牌偏差很容易被识别，所有零假设均假设LLMs具有真正的推理能力。本研究的发现以统计保证表明，多数LLMs在逻辑推理方面仍然有困难。尽管它们在经典问题上表现出色，但它们的成功主要依赖于识别带有强烈令牌偏差的表面模式，因此引发了对其实际推理和泛化能力的担忧。

探索令牌偏差: 大型语言模型尚未成为真正的推理者