机器翻译元评估通过翻译准确度挑战集

Jan, 2024

机器翻译元评估通过翻译准确度挑战集

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman...

TL;DR介绍了一个跨越146种语言对的对比挑战集ACES，以发现度量标准是否能够识别68种翻译准确性错误，并通过对WMT 2022和2023度量标准共享任务中的50个度量标准进行基准测试，评估其渐进性能和对各种语言现象的敏感性。结果显示，不同的度量标准家族在不同的现象上存在困难，并且基于大型语言模型的方法的可靠性表现不佳。扩展了ACES以包括错误跨度注释，称为SPAN-ACES，并使用该数据集评估基于跨度的错误度量，结果表明这些度量标准还需要较大改进。最后，提供了构建更好的机器翻译度量标准的一些建议，包括专注于错误标签而非分数，融合多个度量标准，设计明确专注于源句的策略，专注于语义内容，并选择适合的基本模型来进行表示。

Abstract

Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited