The advent of large language models (LLMs) has enabled significant
performance gains in the field of natural language processing. However, recent
studies have found that LLMs often resort to shortcuts when performing tasks,
creating an illusion of enhanced performance while lacking generalizability in
their decision rules. This phenomenon introduces challenges in accurately
assessing natural language understanding in LLMs. Our paper provides a concise
survey of relevant research in this area and puts forth a perspective on the
implications of shortcut learning in the evaluation of language models,
specifically for NLU tasks. This paper urges more research efforts to be put
towards deepening our comprehension of shortcut learning, contributing to the
development of more robust language models, and raising the standards of NLU
evaluation in real-world scenarios.

大型语言模型在自然语言处理领域取得了重要的性能提升，然而近期的研究发现，这些模型在执行任务时往往使用了捷径，导致性能看起来得到了提升，却缺乏泛化能力。这一现象给大型语言模型的自然语言理解评估带来了挑战。本文对该领域的相关研究进行了简明调查，并提出了对于捷径学习在语言模型评估中的影响的观点，特别是对于 NLU 任务。本文呼吁加大对捷径学习的研究力度，促进更加强大的语言模型的开发，并提高在实际场景中的 NLU 评估标准。

学习快速捷径：关于语言模型中自然语言理解的误导承诺

Learning Shortcuts: On the Misleading Promise of NLU in Language Models

Recent advances in zero-shot and few-shot learning have shown promise for a
scope of research and practical purposes. However, this fast-growing area lacks
standardized evaluation suites for non-English languages, hindering progress
outside the Anglo-centric paradigm. To address this line of research, we
propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that
includes six more complex NLU tasks for Russian, covering multi-hop reasoning,
ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on
systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented
adversarial attacks and perturbations for analyzing robustness, and (ii)
subpopulations for nuanced interpretation. The detailed analysis of testing the
autoregressive baselines indicates that simple spelling-based perturbations
affect the performance the most, while paraphrasing the input has a more
negligible effect. At the same time, the results demonstrate a significant gap
between the neural and human baselines for most tasks. We publicly release TAPE
(this http URL) to foster research on robust LMs that can generalize to
new tasks when little to no supervision is available.

该研究提出了一个名为 TAPE 的基准测试，用于非英语语言的 NLU 评估，特别是适用于俄语的多跳思维，伦理概念，逻辑和常识知识等领域，着重于语言为导向的对抗攻击和扰动分析，通过测试自回归基线，发现简单的拼写变化与输入重复对性能影响最大，同时，结果表明在大多数任务上，神经和人类基线之间存在显着差距。