We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.

我们研究了对抗性合成文本上的机器翻译评估指标的性能，以阐明指标的稳健性。我们对三个流行的机器翻译指标（BERTScore、BLEURT和COMET）进行了单词级和字符级的攻击实验。我们的人工实验验证了自动指标倾向于过度惩罚对抗性降级翻译。我们还发现了BERTScore评级的不一致性，在判断原始句子和对抗性降级句子相似的同时，将降级翻译与参考文献相比较，判断其比原始句子明显更差。我们确定了一些脆弱性模式，从而推动更稳健的指标开发。

自动机器翻译度量指标的鲁棒性测试与对抗攻击