Large language models have recently made tremendous progress in a variety of
aspects, e.g., cross-task generalization, instruction following.
Comprehensively evaluating the capability of large language models in multiple
tasks is of great importance. In this paper, we propose M3KE, a Massive
Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to
measure knowledge acquired by Chinese large language models by testing their
multitask accuracy in zero- and few-shot settings. We have collected 20,477
questions from 71 tasks. Our selection covers all major levels of Chinese
education system, ranging from the primary school to college, as well as a wide
variety of subjects, including humanities, history, politics, law, education,
psychology, science, technology, art and religion. All questions are
multiple-choice questions with four options, hence guaranteeing a standardized
and unified assessment process. We've assessed a number of state-of-the-art
open-source Chinese large language models on the proposed benchmark. The size
of these models varies from 335M to 130B parameters. Experiment results
demonstrate that they perform significantly worse than GPT-3.5 that reaches an
accuracy of ~ 48% on M3KE. The dataset is available at
this https URL

这篇论文介绍了 M3KE 评估标准，它是一个用于测试中文大型语言模型在各种学科和教育级别下零样本和少样本的多任务准确性的基准。通过在该基准上对比，研究人员发现 GPT-3.5 在 M3KE 上达到了约 48% 的准确率，比其他中文语言模型表现更为优异。

M3KE: 一种用于中文大型语言模型的庞大多级多主题知识评估基准

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark  for Chinese Large Language Models

The development of large-scale Chinese language models is flourishing, yet
there is a lack of corresponding capability assessments. Therefore, we propose
a test to measure the multitask accuracy of large Chinese language models. This
test encompasses four major domains, including medicine, law, psychology, and
education, with 15 subtasks in medicine and 8 subtasks in education. We found
that the best-performing models in the zero-shot setting outperformed the
worst-performing models by nearly 22 percentage points on average. Across the
four major domains, the average zero-shot accuracy of all models did not exceed
0.5. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot
accuracy of 0.703 in clinical medicine, which was the highest accuracy among
all models across all subtasks. All models performed poorly in the legal
domain, with the highest zero-shot accuracy reaching only 0.259. By
comprehensively evaluating the breadth and depth of knowledge across multiple
disciplines, this test can more accurately identify the shortcomings of the
models.

本文提出了一个测试大规模中文语言模型多任务准确性的方法，测试涵盖医学、法律、心理学和教育等四个主要领域，在医学和教育领域共包含 15 个子任务和 8 个子任务。测试表明，在零样本情况下，表现最好的模型平均优于表现最差的模型近 22 个百分点。此外，本测试可以跨多个领域全面评估知识的广度和深度，更准确地识别模型的缺陷。

测量大规模多任务中文理解

Measuring Massive Multitask Chinese Understanding

We propose a new test to measure a text model's multitask accuracy. The test
covers 57 tasks including elementary mathematics, US history, computer science,
law, and more. To attain high accuracy on this test, models must possess
extensive world knowledge and problem solving ability. We find that while most
recent models have near random-chance accuracy, the very largest GPT-3 model
improves over random chance by almost 20 percentage points on average. However,
on every one of the 57 tasks, the best models still need substantial
improvements before they can reach expert-level accuracy. Models also have
lopsided performance and frequently do not know when they are wrong. Worse,
they still have near-random accuracy on some socially important subjects such
as morality and law. By comprehensively evaluating the breadth and depth of a
model's academic and professional understanding, our test can be used to
analyze models across many tasks and to identify important shortcomings.

论文提出了一种新的测试方法，以测量文本模型的多任务准确性，涵盖了包括数学、历史、计算机科学、法律等 57 项任务，为了达到高准确性，模型必须具备丰富的世界知识和问题解决能力。通过综合评估模型的学术和专业理解的广度和深度，我们的测试可以用于分析许多任务中的模型并确定重要的缺陷。