Expert-designed close-ended benchmarks serve as vital tools in assessing the
knowledge capacity of large language models (LLMs). Despite their widespread
use, concerns have mounted regarding their reliability due to limited test
scenarios and an unavoidable risk of data contamination. To rectify this, we
present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge
capacity through knowledge-invariant perturbations. These perturbations employ
human-like restatement techniques to generate on-the-fly test samples from
static benchmarks, meticulously retaining knowledge-critical content while
altering irrelevant details. Our toolkit further includes a suite of transition
analyses that compare performance on raw vs. perturbed test sets to precisely
assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are
re-evaluated using PertEval. Results reveal significantly inflated performance
of the LLMs on raw benchmarks, including an absolute 21% overestimation for
GPT-4. Additionally, through a nuanced response pattern analysis, we discover
that PertEval retains LLMs' uncertainty to specious knowledge, potentially
being resolved through rote memorization and leading to inflated performance.
We also find that the detailed transition analyses by PertEval could illuminate
weaknesses in existing LLMs' knowledge mastery and guide the development of
refinement. Given these insights, we posit that PertEval can act as an
essential tool that, when applied alongside any close-ended benchmark, unveils
the true knowledge capacity of LLMs, marking a significant step toward more
trustworthy LLM evaluation.

通过 PertEval 工具集，利用知识不变的扰动以人类样式修正技巧从静态基准中生成即席测试样本，精确评估 LLMs 真正的知识能力。通过对六个最先进的 LLMs 进行重新评估，结果显示 LLMs 在原始基准上明显夸大性能，其中包括 GPT-4 超过 21% 的绝对高估。此外，PertEval 的详细过渡分析可揭示现有 LLMs 知识掌握的弱点，并指导改进的开发，从而发现了一种重要的评估 LLMs 真实知识能力的方法。

PertEval: 揭示权知识能力不变扰动下的 LLMs 真实知识容量

PertEval: Unveiling Real Knowledge Capacity of LLMs with  Knowledge-Invariant Perturbations

Prior work has demonstrated large language models' (LLMs) potential to
discern statistical tendencies within their pre-training corpora. Despite that,
many examinations of LLMs' knowledge capacity focus on knowledge explicitly
appearing in the training data or implicitly inferable from similar contexts.
How well an LLM captures the corpus-level statistical trends of concepts for
reasoning, especially long-tail ones, is still underexplored. In this study, we
introduce a novel few-shot question-answering task (CPopQA) that examines LLMs'
statistical ranking abilities for long-tail cultural concepts (e.g., holidays),
with a specific focus on these concepts' popularity in the United States and
the United Kingdom, respectively. We curate a dataset containing 459 holidays
across 58 countries, generating a total of 6,000 QA testing pairs. Experiments
on four strong LLMs show that large models are capable of ranking long-tail
cultural concepts regarding their statistical tendency. Notably, GPT-3.5
displayed superior performance and exhibited its potential to identify
geo-cultural proximity across continents.

该研究通过引入一种新的少样本问答任务（CPopQA），评估了大型语言模型（LLMs）对长尾文化概念（如假期）的统计排名能力，特别关注这些概念在美国和英国的受欢迎程度，并发现 GPT-3.5 在跨大洲识别地理文化接近性方面表现出卓越性能。