During the evolution of large models, performance evaluation is necessarily
performed on the intermediate models to assess their capabilities, and on the
well-trained model to ensure safety before practical application. However,
current model evaluations mainly rely on specific tasks and datasets, lacking a
united framework for assessing the multidimensional intelligence of large
models. In this perspective, we advocate for a comprehensive framework of
artificial general intelligence (AGI) test, aimed at fulfilling the testing
needs of large language models and multi-modal large models with enhanced
capabilities. The AGI test framework bridges cognitive science and natural
language processing to encompass the full spectrum of intelligence facets,
including crystallized intelligence, a reflection of amassed knowledge and
experience; fluid intelligence, characterized by problem-solving and adaptive
reasoning; social intelligence, signifying comprehension and adaptation within
multifaceted social scenarios; and embodied intelligence, denoting the ability
to interact with its physical environment. To assess the multidimensional
intelligence of large models, the AGI test consists of a battery of
well-designed cognitive tests adopted from human intelligence tests, and then
naturally encapsulates into an immersive virtual community. We propose that the
complexity of AGI testing tasks should increase commensurate with the
advancements in large models. We underscore the necessity for the
interpretation of test results to avoid false negatives and false positives. We
believe that cognitive science-inspired AGI tests will effectively guide the
targeted improvement of large models in specific dimensions of intelligence and
accelerate the integration of large models into human society.

大型模型的性能评估是保证其能力和应用安全性的必要步骤，而当前的模型评估缺乏一个统一的框架来评估大型模型的多维智能。本文提出了一个全面的人工智能测试框架，包括认知科学和自然语言处理，旨在评估大型模型的智能水平，并通过一系列认知测试来指导其在不同智能维度上的改进和加速其融入人类社会的过程。

将认知任务整合进针对大型模型的人工通用智能测试

Integration of cognitive tasks into artificial general intelligence test  for large models

Artificial intelligence develops techniques and systems whose performance
must be evaluated on a regular basis in order to certify and foster progress in
the discipline. We will describe and critically assess the different ways AI
systems are evaluated. We first focus on the traditional task-oriented
evaluation approach. We see that black-box (behavioural evaluation) is becoming
more and more common, as AI systems are becoming more complex and
unpredictable. We identify three kinds of evaluation: Human discrimination,
problem benchmarks and peer confrontation. We describe the limitations of the
many evaluation settings and competitions in these three categories and propose
several ideas for a more systematic and robust evaluation. We then focus on a
less customary (and challenging) ability-oriented evaluation approach, where a
system is characterised by its (cognitive) abilities, rather than by the tasks
it is designed to solve. We discuss several possibilities: the adaptation of
cognitive tests used for humans and animals, the development of tests derived
from algorithmic information theory or more general approaches under the
perspective of universal psychometrics.

通过描述和评估不同的 AI 系统评估方式，本文首先关注传统的以任务为中心的评估方法，然后提出了能力为中心的新型评估方法，并探讨了几种可能的评估方式，包括从认知测试中衍生的测试和通用心理测量法的更一般方法。