Large Language Models (LLMs) have achieved remarkable results in the machine translation evaluation task, yet there remains a gap in knowledge regarding how they utilize the provided data to conduct evaluations. This study aims to explore how LLMs leverage source and reference information in evaluating translations, with the ultimate goal of better understanding the working mechanism of LLMs. To this end, we design the controlled experiments across various input modes and model types, and employ both coarse-grained and fine-grained prompts to discern the utility of source versus reference information. Surprisingly, we find that reference information significantly enhances the evaluation accuracy, while source information sometimes is counterproductive, indicating a lack of cross-lingual capability when using LLMs to evaluate translations. We further conduct a meta-evaluation for translation error detection of LLMs, observing a similar phenomenon. These findings also suggest a potential research direction for LLMs that fully exploits the cross-lingual capability of LLMs to achieve better performance in machine translation evaluation tasks.

大型语言模型在机器翻译评估任务中取得了显著的成果，然而关于它们如何利用提供的数据进行评估仍存在知识空白。本研究旨在探索大型语言模型如何利用源语言和参考信息进行评估，从而更好地理解大型语言模型的工作机制。通过设计不同的输入模式和模型类型进行受控实验，并使用粗粒度和细粒度提示来识别源语言与参考信息的有效性，我们惊讶地发现参考信息显著提高了评估准确性，而源语言信息有时会适得其反，表明在使用大型语言模型评估翻译时缺乏跨语言能力。我们还对大型语言模型的翻译错误检测进行了元评估，观察到类似的现象。这些发现也为充分利用大型语言模型的跨语言能力以在机器翻译评估任务中取得更好性能提供了潜在的研究方向。

在源语言中迷失：大型语言模型如何评估机器翻译的质量