Large language models (LLMs) are primarily evaluated by overall performance
on various text understanding and generation tasks. However, such a paradigm
fails to comprehensively differentiate the fine-grained language and cognitive
skills, rendering the lack of sufficient interpretation to LLMs' capabilities.
In this paper, we present FAC$^2$E, a framework for Fine-grAined and
Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate
LLMs' evaluation in a multi-dimensional and explainable manner by dissociating
the language-related capabilities and the cognition-related ones. Besides,
through extracting the intermediate reasoning from LLMs, we further break down
the process of applying a specific capability into three sub-steps: recalling
relevant knowledge, utilizing knowledge, and solving problems. Finally,
FAC$^2$E evaluates each sub-step of each fine-grained capability, providing a
two-faceted diagnosis for LLMs. Utilizing FAC$^2$E, we identify a common
shortfall in knowledge utilization among models and propose a straightforward,
knowledge-enhanced method to mitigate this issue. Our results not only showcase
promising performance enhancements but also highlight a direction for future
LLM advancements.

FAC$^2$E 是一个针对大型语言模型 (LLMs) 的能力评估框架，通过提取 LLMs 的中间推理，将特定能力应用过程分解为三个子步骤，并评估每个细分能力的各个子步骤，从而全面区分 LLMs 的语言相关能力和认知相关能力。利用 FAC$^2$E，我们发现模型中知识利用方面存在常见不足，并提出了一种简单、知识增强的方法来缓解这个问题。我们的研究不仅展示了有希望的性能改进，还为未来 LLMs 的发展方向提供了启示。