As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.

机器翻译中，自动评估指标在评分更长的翻译文本方面的有效性仍不清楚。本文提出了一种通过现有句子级数据创建段落级数据用于训练和元评估指标的方法，并利用这些新数据集对现有句子级指标进行基准测试，以及在段落级训练学习指标。有趣的是，我们的实验结果表明，使用句子级指标评分整个段落与使用专为段落级工作的指标同样有效。我们推测这一结果可能归因于基于参考的评估任务的特性以及数据集在捕捉段落级翻译中发生的各种现象方面的局限性。

在段落级别上训练和元评估机器翻译评估指标