In the age of large-scale language models, benchmarks like the Massive
Multitask Language Understanding (MMLU) have been pivotal in pushing the
boundaries of what AI can achieve in language comprehension and reasoning
across diverse domains. However, as models continue to improve, their
performance on these benchmarks has begun to plateau, making it increasingly
difficult to discern differences in model capabilities. This paper introduces
MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven
MMLU benchmark by integrating more challenging, reasoning-focused questions and
expanding the choice set from four to ten options. Additionally, MMLU-Pro
eliminates the trivial and noisy questions in MMLU. Our experimental results
show that MMLU-Pro not only raises the challenge, causing a significant drop in
accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability
under varying prompts. With 24 different prompt styles tested, the sensitivity
of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in
MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT)
reasoning achieved better performance on MMLU-Pro compared to direct answering,
which is in stark contrast to the findings on the original MMLU, indicating
that MMLU-Pro includes more complex reasoning questions. Our assessments
confirm that MMLU-Pro is a more discriminative benchmark to better track
progress in the field.

在大规模语言模型的时代，本论文引入了 MMLU-Pro，这是一个增强的数据集，旨在扩展主要以知识驱动的 MMLU 基准测试，并且在其中集成了更具挑战性且关注推理的问题，从 4 个选项扩展到 10 个选项，同时消除了 MMLU 中的琐碎和噪声问题。与 MMLU 相比，实验证明 MMLU-Pro 不仅提高了挑战性，导致准确率下降了 16％至 33％，而且对于不同提示的模型评分的敏感性也下降了。此外，我们发现在 MMLU-Pro 上，采用 Chain of Thought (CoT) 推理的模型比直接回答问题的模型表现更好，这与原始 MMLU 上的研究结果形成鲜明对比，表明 MMLU-Pro 包含更复杂的推理问题。我们的评估证实 MMLU-Pro 是一个更有区分性的基准测试，以更好地追踪领域的进展。