This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases, statistically significant differences were observed, suggesting a high variability of human annotation.

我们努力复现了Vamvas和Sennrich（2022年）的研究中提到的人类评价实验的结果，该实验评估了机器翻译（MT）输出中检测到的过度和不足翻译（比原文包含更多或更少信息的翻译）的自动系统。尽管作者提供了优质的文档和代码，但我们发现了一些在重现实验设置方面的问题，并提出了提高可重复性的建议。我们复制的结果基本上证实了原研究的结论，但在一些情况下观察到了统计显著差异，表明人类标注存在很高的可变性。

借助作者的一点帮助：重复人工评估机器翻译错误检测器