Recent studies emphasize the need of document context in human evaluation of
machine translations, but little research has been done on the impact of user
interfaces on annotator productivity and the reliability of assessments. In
this work, we compare human assessment data from the last two WMT evaluation
campaigns collected via two different methods for document-level evaluation.
Our analysis shows that a document-centric approach to evaluation where the
annotator is presented with the entire document context on a screen leads to
higher quality segment and document level assessments. It improves the
correlation between segment and document scores and increases inter-annotator
agreement for document scores but is considerably more time consuming for
annotators.

研究发现，针对机器翻译的人工评估需要考虑文本上下文，然而用户界面对于标注者的生产力和评估可靠性的影响却鲜有研究。本文通过比较两种不同方法获得的人工评估数据，证明了一个以文档为中心的评估方法可以提高数据的质量，但却需要更多的时间投资。

关于机器翻译结果的大规模文档层面人工评估用户界面

On User Interfaces for Large-Scale Document-Level Human Evaluation of  Machine Translation Outputs

Standard automatic metrics, e.g. BLEU, are not reliable for document-level MT
evaluation. They can neither distinguish document-level improvements in
translation quality from sentence-level ones, nor identify the discourse
phenomena that cause context-agnostic translations. This paper introduces a
novel automatic metric BlonDe to widen the scope of automatic MT evaluation
from sentence to document level. BlonDe takes discourse coherence into
consideration by categorizing discourse-related spans and calculating the
similarity-based F1 measure of categorized spans. We conduct extensive
comparisons on a newly constructed dataset BWB. The experimental results show
that BlonDe possesses better selectivity and interpretability at the
document-level, and is more sensitive to document-level nuances. In a
large-scale human study, BlonDe also achieves significantly higher Pearson's r
correlation with human judgments compared to previous metrics.

本文提出了一种新型的自动评估方法 BlonDe，通过将话语连贯性考虑在内来扩大自动翻译评估的范围，从句子级别提高到文档级别，该方法能够更好地区分文档级别的翻译质量改进和句子级别的改进，并且具有更好的判别性、可解释性和敏感性。在大规模的人类研究中，BlonDe 也成功地取得了比前期评估指标更高的 Pearson r 相关度。

BlonDe：一种用于文档级机器翻译的自动评估指标

BlonDe: An Automatic Evaluation Metric for Document-level Machine  Translation

Recent research suggests that neural machine translation achieves parity with
professional human translation on the WMT Chinese--English news translation
task. We empirically test this claim with alternative evaluation protocols,
contrasting the evaluation of single sentences and entire documents. In a
pairwise ranking experiment, human raters assessing adequacy and fluency show a
stronger preference for human over machine translation when evaluating
documents as compared to isolated sentences. Our findings emphasise the need to
shift towards document-level evaluation as machine translation improves to the
degree that errors which are hard or impossible to spot at the sentence-level
become decisive in discriminating quality of different translation outputs.

在文档级翻译评估中，人类对于独立句子评价更偏向于人类翻译而非机器翻译，强调了机器翻译向文档级评价迈进的必要性。