Large language models have recently made tremendous progress in a variety of
aspects, e.g., cross-task generalization, instruction following.
Comprehensively evaluating the capability of large language models in multiple
tasks is of great importance. In this paper, we propose M3KE, a Massive
Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to
measure knowledge acquired by Chinese large language models by testing their
multitask accuracy in zero- and few-shot settings. We have collected 20,477
questions from 71 tasks. Our selection covers all major levels of Chinese
education system, ranging from the primary school to college, as well as a wide
variety of subjects, including humanities, history, politics, law, education,
psychology, science, technology, art and religion. All questions are
multiple-choice questions with four options, hence guaranteeing a standardized
and unified assessment process. We've assessed a number of state-of-the-art
open-source Chinese large language models on the proposed benchmark. The size
of these models varies from 335M to 130B parameters. Experiment results
demonstrate that they perform significantly worse than GPT-3.5 that reaches an
accuracy of ~ 48% on M3KE. The dataset is available at
this https URL

这篇论文介绍了 M3KE 评估标准，它是一个用于测试中文大型语言模型在各种学科和教育级别下零样本和少样本的多任务准确性的基准。通过在该基准上对比，研究人员发现 GPT-3.5 在 M3KE 上达到了约 48% 的准确率，比其他中文语言模型表现更为优异。