Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications. However, the existing benchmarks for comprehensively evaluating these LLMs are still insufficient, particularly in terms of measuring knowledge that LLMs capture. Current datasets collect questions from Chinese examinations across different subjects and educational levels to address this issue. Yet, these benchmarks primarily focus on objective questions such as multiple-choice questions, leading to a lack of diversity in question types. To tackle this problem, we propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark in this paper. LHMKE is designed to provide a comprehensive evaluation of the knowledge acquisition capabilities of Chinese LLMs. It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams. Notably, LHMKE includes both objective and subjective questions, offering a more holistic evaluation of the knowledge level of LLMs. We have assessed 11 Chinese LLMs under the zero-shot setting, which aligns with real examinations, and compared their performance across different subjects. We also conduct an in-depth analysis to check whether GPT-4 can automatically score subjective predictions. Our findings suggest that LHMKE is a challenging and advanced testbed for Chinese LLMs.

LHMKE是一种大规模、全面和多学科知识评估基准，旨在为中文大型语言模型的知识获取能力提供全面评估。它包括10,465个问题，涵盖30个学科的75个任务，既包含客观题又包含主观题，以更全面评估大型语言模型的知识水平。我们对11个中文大型语言模型进行了零-shot评估，并比较了它们在不同学科的性能。通过深入分析，我们也验证了GPT-4是否能够自动评分主观预测。我们的研究结果表明，LHMKE是一个具有挑战性和先进性的中文大型语言模型评估标准。

LHMKE：用于中文大语言模型的大规模综合多学科知识评估基准