We present metrics for evaluating dialog systems through a
psychologically-grounded "human" lens: conversational agents express a
diversity of both states (short-term factors like emotions) and traits
(longer-term factors like personality) just as people do. These interpretable
metrics consist of five measures from established psychology constructs that
can be applied both across dialogs and on turns within dialogs: emotional
entropy, linguistic style and emotion matching, as well as agreeableness and
empathy. We compare these human metrics against 6 state-of-the-art automatic
metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We
also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which
consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We
demonstrate the proposed human metrics offer novel information, are
uncorrelated with automatic metrics, and lead to increased accuracy beyond
existing automatic metrics for predicting crowd-sourced dialog judgements. The
interpretability and unique signal of our proposed human-centered framework
make it a valuable tool for evaluating and improving dialog systems.

提出基于心理学，对话系统评估的度量标准，包括情感熵、语言风格和情感匹配度、宜人性和共情等 5 个指标。将这些指标与 6 个最先进的自动评价指标进行比较，并使用三种不同模型（ChatGPT、GPT-3 和 BlenderBot）的对话数据集进行实验，结果表明，所提出的人类度量标准不仅提供了新颖的信息，而且与自动度量标准不相关，并且优于现有的自动度量标准在预测众包对话评价方面的准确性。所提出的基于人类中心的框架具有解释性和独特的信号，是评估和改进对话系统的有价值的工具。