As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

本文介绍了一个涵盖自然科学、社会科学、工程学和人文学科等多个领域的全面中文基准 CMMLU，并通过评估 18 种面向性能的多语言和中文 LLMs，在不同的主题和设置下评估它们的性能，结果显示，大多数现有LLM在提供上下文示例和思维链提示时仍然难以达到50%的平均准确性，而随机基准线为25%，这凸显出LLMs有显着的改进空间。

CMMLU: 用于测量中文海量多任务语言理解的工具