Expert-designed close-ended benchmarks serve as vital tools in assessing the
knowledge capacity of large language models (LLMs). Despite their widespread
use, concerns have mounted regarding their reliability due to limited test
scenarios and an unavoidable risk of data contamination. To rectify this, we
present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge
capacity through knowledge-invariant perturbations. These perturbations employ
human-like restatement techniques to generate on-the-fly test samples from
static benchmarks, meticulously retaining knowledge-critical content while
altering irrelevant details. Our toolkit further includes a suite of transition
analyses that compare performance on raw vs. perturbed test sets to precisely
assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are
re-evaluated using PertEval. Results reveal significantly inflated performance
of the LLMs on raw benchmarks, including an absolute 21% overestimation for
GPT-4. Additionally, through a nuanced response pattern analysis, we discover
that PertEval retains LLMs' uncertainty to specious knowledge, potentially
being resolved through rote memorization and leading to inflated performance.
We also find that the detailed transition analyses by PertEval could illuminate
weaknesses in existing LLMs' knowledge mastery and guide the development of
refinement. Given these insights, we posit that PertEval can act as an
essential tool that, when applied alongside any close-ended benchmark, unveils
the true knowledge capacity of LLMs, marking a significant step toward more
trustworthy LLM evaluation.

通过 PertEval 工具集，利用知识不变的扰动以人类样式修正技巧从静态基准中生成即席测试样本，精确评估 LLMs 真正的知识能力。通过对六个最先进的 LLMs 进行重新评估，结果显示 LLMs 在原始基准上明显夸大性能，其中包括 GPT-4 超过 21% 的绝对高估。此外，PertEval 的详细过渡分析可揭示现有 LLMs 知识掌握的弱点，并指导改进的开发，从而发现了一种重要的评估 LLMs 真实知识能力的方法。