Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

自动机器翻译评估是推动机器翻译系统快速迭代发展的关键工具，本文在已有单一评分指标的基础上提出AutoMQM，一种通过大语言模型的推理和上下文学习能力来识别和分类翻译错误的提示技术。通过评估最新的大语言模型PaLM和PaLM-2，通过简单的得分预测提示，发现AutoMQM在PaLM-2模型上优于仅提示得分的性能，并能提供与人工注释相一致的错误范围，具有解释性。

错误中蕴藏着魔鬼的力量：利用大型语言模型进行细粒度机器翻译评估