Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

本研究解决了机器翻译中幻觉检测的关键问题，尤其是在低资源语言中的显著挑战。通过评估大型语言模型和语义相似性，研究发现不同模型选择对性能影响显著，Llama3-70B在高资源语言中的表现超过了前沿模型，而Claude Sonnet在低资源语言中表现优于其他模型，为机器翻译的可靠性提供了新的见解。

基于大型语言模型的低资源和高资源语言机器翻译幻觉检测