BriefGPT.xyz
May, 2024
关于小学算术的大型语言模型性能的仔细检查
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
HTML
PDF
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu...
TL;DR
大型语言模型在数学推理的许多基准测试上取得了令人瞩目的成功,但人们越来越担心其中的一部分性能实际上是由于数据集污染,而不是真正的推理能力。调查显示,许多模型可能已经部分记忆了基准测试的例子,导致在新的基准测试上准确度下降。
Abstract
large language models
(LLMs) have achieved impressive success on many benchmarks for
mathematical reasoning
. However, there is growing concern that some of this performance actually reflects
→