Knowledge-grounded dialogue systems powered by large language models often
generate responses that, while fluent, are not attributable to a relevant
source of information. Progress towards models that do not exhibit this issue
requires evaluation metrics that can quantify its prevalence. To this end, we
introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN),
comprised of 12k dialogue turns generated by neural dialogue systems trained on
three knowledge-grounded dialogue corpora. We collect human annotations
assessing the extent to which the models' responses can be attributed to the
given background information. We then use BEGIN to analyze eight evaluation
metrics. We find that these metrics rely on spurious correlations, do not
reliably distinguish attributable abstractive responses from unattributable
ones, and perform substantially worse when the knowledge source is longer. Our
findings underscore the need for more sophisticated and robust evaluation
metrics for knowledge-grounded dialogue. We make BEGIN publicly available at
this https URL

该研究提出了用于评估基于知识的对话系统质量的 BEGIN 基准，该基准由 12k 条对话数据组成，评估了 8 个评估指标，结果发现这些指标过度依赖并不可靠，在长文本下表现更差，说明需要更加精细和强健的评估指标。