We investigate evaluation metrics for end-to-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model's generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

本文研究对话响应生成系统的评估指标，其中没有可用的监督标签。最近，对话响应生成的研究采用了机器翻译的指标来比较模型生成的响应和单个目标响应。我们展示了这些指标与非技术Twitter领域中的人类判断之间的关系非常弱，而在技术Ubuntu领域中根本没有。我们提供了定量和定性结果，突出了现有指标的特定弱点，并提供了未来开发更好的自动评估指标的建议。

如何不评估您的对话系统：对话响应生成任务无监督评估指标的实证研究