We propose a new benchmark, ComperDial, which facilitates the training and
evaluation of evaluation metrics for open-domain dialogue systems. ComperDial
consists of human-scored responses for 10,395 dialogue turns in 1,485
conversations collected from 99 dialogue agents submitted to the Commonsense
Persona-grounded Dialogue (CPD) challenge. As a result, for any dialogue, our
benchmark includes multiple diverse responses with variety of characteristics
to ensure more robust evaluation of learned dialogue metrics. In addition to
single-turn response scores, ComperDial also contains dialogue-level
human-annotated scores, enabling joint assessment of multi-turn model responses
throughout a dialogue. Finally, building off ComperDial, we devise a new
automatic evaluation metric to measure the general similarity of
model-generated dialogues to human conversations. Our experimental results
demonstrate that our novel metric, CPDScore is more correlated with human
judgments than existing metrics. We release both ComperDial and CPDScore to the
community to accelerate development of automatic evaluation metrics for
open-domain dialogue systems.

我们提出了一个新的基准系统 ComperDial，用于为开放领域对话系统的训练和评估提供测评度量标准。ComperDial 包括来自 99 个对话代理的 1,485 个对话中的 10,395 个对话转折的人工评分响应，除了单个对话转折的评分外，也包含对整个对话进行人工注释的评分，我们利用 ComperDial 开发了一种新的自动评估度量标准 CPDScore，实验证明 CPDScore 与人类判断更相关。我们将 ComperDial 和 CPDScore 发布给社区，以加速开放领域对话系统自动评估度量标准的开发。