We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that ``truly'' understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.

我们提出了MMLU-SR，这是一个新颖的数据集，旨在通过在问题回答任务中使用修改后的术语挑战大型语言模型（LLMs）的性能来测量其真实的理解能力。尽管最近流行的LLMs在MMLU排行榜上获得了高分，但我们发现在这种替换后模型的性能大幅下降，暗示了其理解能力较差。这个新的基准提供了一个严格测试真实模型理解能力的基准，并向广大科学界提出了挑战。

论证还是简单的下一个令牌预测？用于对大型语言模型进行压力测试的基准