From pre-trained language model (PLM) to large language model (LLM), the
field of natural language processing (NLP) has witnessed steep performance
gains and wide practical uses. The evaluation of a research field guides its
direction of improvement. However, LLMs are extremely hard to thoroughly
evaluate for two reasons. First of all, traditional NLP tasks become inadequate
due to the excellent performance of LLM. Secondly, existing evaluation tasks
are difficult to keep up with the wide range of applications in real-world
scenarios. To tackle these problems, existing works proposed various benchmarks
to better evaluate LLMs. To clarify the numerous evaluation tasks in both
academia and industry, we investigate multiple papers concerning LLM
evaluations. We summarize 4 core competencies of LLM, including reasoning,
knowledge, reliability, and safety. For every competency, we introduce its
definition, corresponding benchmarks, and metrics. Under this competency
architecture, similar tasks are combined to reflect corresponding ability,
while new tasks can also be easily added into the system. Finally, we give our
suggestions on the future direction of LLM's evaluation.

从预训练语言模型（PLM）到大型语言模型（LLM），自然语言处理（NLP）领域已经取得了明显的性能提升和广泛的实际应用。为了解决评估 LLM 的困难，这篇论文调查了关于 LLM 评估的多篇论文，并总结了 LLM 的四个核心能力，包括推理、知识、可靠性和安全性。在这个能力结构下，相似的任务被合并以反映相应的能力，而新的任务也可以轻松地添加到系统中。最后，给出了关于 LLM 评估未来方向的建议。

核心竞争力视角下的大型语言模型评估调查

Through the Lens of Core Competency: Survey on Evaluation of Large  Language Models

Automatic assessment of learner competencies is a fundamental task in
intelligent tutoring systems. An assessment rubric typically and effectively
describes relevant competencies and competence levels. This paper presents an
approach to deriving a learner model directly from an assessment rubric
defining some (partial) ordering of competence levels. The model is based on
Bayesian networks and exploits logical gates with uncertainty (often referred
to as noisy gates) to reduce the number of parameters of the model, so to
simplify their elicitation by experts and allow real-time inference in
intelligent tutoring systems. We illustrate how the approach can be applied to
automatize the human assessment of an activity developed for testing
computational thinking skills. The simple elicitation of the model starting
from the assessment rubric opens up the possibility of quickly automating the
assessment of several tasks, making them more easily exploitable in the context
of adaptive assessment tools and intelligent tutoring systems.

本文提出了一种基于贝叶斯网络的方法，利用带有不确定性的逻辑门简化模型，从评估规则中直接推算出学习者模型，可以应用于计算思维技能测试中，并为自适应评估工具和智能教学系统中的快速自动化评估打开了可能性。