Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.

本研究针对大型语言模型 (LLM) 在语言、知识和推理基准上表现出色，但其基准性能并不能代表通用认知能力的观点进行了挑战。作者提出现有基准的固有限制和评估方法的不足表明，LLM 在许多任务上并未具备强大的能力，因此建议不应将基准性能作为评价 LLM 认知能力的可靠指标。

基准评估大型语言模型的固有限制