Existing reference-free turn-level evaluation metrics for chatbots
inadequately capture the interaction between the user and the system.
Consequently, they often correlate poorly with human evaluations. To address
this issue, we propose a novel model-agnostic approach that leverages
Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level
interaction between the system and the user based on a given evaluation
dimension. Experimental results on the widely used FED dialogue evaluation
dataset demonstrate that our approach significantly improves the correlation
with human judgment compared with existing evaluation systems. By replacing the
negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve
a relative 60.5% higher Spearman correlation on average for the FED evaluation
metric. Our code is publicly available at this https URL

提出了一种新的模型无关方法，利用条件点互信息来衡量给定评估维度下系统和用户之间的对话交互，实验结果显示，与现有的评估系统相比，该方法在广泛使用的 FED 对话评估数据集上显著提高了与人类判断的相关性，在 FED 评估指标上平均达到了 60.5％的相关性提高率。

C-PMI: 条件点间互信息用于对话轮次评估

C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue  Evaluation

It is important to define meaningful and interpretable automatic evaluation
metrics for open-domain dialog research. Standard language generation metrics
have been shown to be ineffective for dialog. This paper introduces the FED
metric (fine-grained evaluation of dialog), an automatic evaluation metric
which uses DialoGPT, without any fine-tuning or supervision. It also introduces
the FED dataset which is constructed by annotating a set of human-system and
human-human conversations with eighteen fine-grained dialog qualities. The FED
metric (1) does not rely on a ground-truth response, (2) does not require
training data and (3) measures fine-grained dialog qualities at both the turn
and whole dialog levels. FED attains moderate to strong correlation with human
judgement at both levels.

本文介绍了 FED 度量（对话的细粒度评估），该度量使用 DialoGPT，不需要微调或监督，以及 FED 数据集，称为人机和人人对话的十八个细粒度对话质量的注释构成。FED 度量不依赖于真实回答，不需要训练数据，并且在回合和整个对话水平上测量细粒度对话质量。FED 在两个级别上与人类判断具有中度到强度的相关性。