BriefGPT.xyz
Feb, 2024
GSM-Plus: 评估LLMs作为数学问题求解器鲁棒性的综合基准
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers
HTML
PDF
Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, Wei Bi
TL;DR
通过对大型语言模型进行广泛的问题变体测试,我们评估了它们的数学推理能力的鲁棒性。结果表明,虽然这些模型在数学推理能力上表现出不同水平,但它们的性能远非稳健。
Abstract
large language models
(LLMs) have achieved impressive performance across various
mathematical reasoning
benchmarks. However, there are increasing debates regarding whether these models truly understand and apply
→