While dialogue remains an important end-goal of natural language research,
the difficulty of evaluation is an oft-quoted reason why it remains troublesome
to make real progress towards its solution. Evaluation difficulties are
actually two-fold: not only do automatic metrics not correlate well with human
judgments, but also human judgments themselves are in fact difficult to
measure. The two most used human judgment tests, single-turn pairwise
evaluation and multi-turn Likert scores, both have serious flaws as we discuss
in this work.
We instead provide a novel procedure involving comparing two full dialogues,
where a human judge is asked to pay attention to only one speaker within each,
and make a pairwise judgment. The questions themselves are optimized to
maximize the robustness of judgments across different annotators, resulting in
better tests. We also show how these tests work in self-play model chat setups,
resulting in faster, cheaper tests. We hope these tests become the de facto
standard, and will release open-source code to that end.

本研究提出一个基于自我对话模型的评价过程，旨在寻找一种在不同注释人员之间具有更强鲁棒性的评级测试方案。经过实验证明，在这种方案下，我们可以在更快、更便宜的情况下推出新的测试标准并发布开源代码。