Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

研究机器翻译质量评估的难点在于缺乏标准程序及评估方法的计量问题。本研究提出一套基于明示错误分析及MQM框架的评估方法，并应用于WMT 2020挑战赛的两个语言对中来自高水平机器翻译模型的输出进行评估。评估结果与WMT众包评估结果不同，人工翻译的结果被明显偏爱，但自动评估指标基于预训练嵌入的表现也足以胜过人工众包评估，为今后的研究提供公共语料库。

专家、误差与上下文: 人工评估机器翻译的大规模研究