Recent text generation research has increasingly focused on open-ended
domains such as story and poetry generation. Because models built for such
tasks are difficult to evaluate automatically, most researchers in the space
justify their modeling choices by collecting crowdsourced human judgments of
text quality (e.g., Likert scores of coherence or grammaticality) from Amazon
Mechanical Turk (AMT). In this paper, we first conduct a survey of 45
open-ended text generation papers and find that the vast majority of them fail
to report crucial details about their AMT tasks, hindering reproducibility. We
then run a series of story evaluation experiments with both AMT workers and
English teachers and discover that even with strict qualification filters, AMT
workers (unlike teachers) fail to distinguish between model-generated text and
human-generated references. We show that AMT worker judgments improve when they
are shown model-generated output alongside human-generated references, which
enables the workers to better calibrate their ratings. Finally, interviews with
the English teachers provide deeper insights into the challenges of the
evaluation process, particularly when rating model-generated text.

本文对目前 45 篇与开放式文本生成相关的论文进行了调查，并发现它们中绝大多数未报告有关 Amazon Mechanical Turk 任务的关键细节，从而影响了可重复性。本文还进行了故事评估实验，发现即使使用严格的资格筛选器，AMT 工作者（与教师不同）也无法区分模型生成的文本和人类生成的参考文本。研究表明，当 AMT 工人同时展示模型生成的输出和人类生成的参考文本时，工人的判断能力得到了提高，并为评估过程提供了深刻的洞察。