At the heart of improving conversational AI is the open problem of how to
evaluate conversations. Issues with automatic metrics are well known (Liu et
al., 2016, arXiv:1603.08023), with human evaluations still considered the gold
standard. Unfortunately, how to perform human evaluations is also an open
problem: differing data collection methods have varying levels of human
agreement and statistical sensitivity, resulting in differing amounts of human
annotation hours and labor costs. In this work we compare five different
crowdworker-based human evaluation methods and find that different methods are
best depending on the types of models compared, with no clear winner across the
board. While this highlights the open problems in the area, our analysis leads
to advice of when to use which one, and possible future directions.

本文研究了如何有效地评估对话系统的性能，发现人工评估是最好的方法，但人工评估方法的不同会导致不同的数量的人工注释和劳动成本，因此我们比较了五种不同的众包工人评估方法，发现不同的方法适用于不同类型的模型比较，建议在何时采用哪种方法，以及未来的研究方向。