This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4's translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations.

本研究针对大规模语言模型（LLMs），特别是GPT-4，在多语言对和领域中，对不同翻译专业水平的人类翻译员进行全面评估，发现GPT-4在总体错误数量上表现与初级翻译员相当，但在中级和高级翻译员之下。我们还观察到在不同语言和领域中性能不平衡，GPT-4的翻译能力从资源丰富的方向逐渐减弱。此外，我们定性地研究了GPT-4和人类翻译员的翻译结果，发现GPT-4的翻译存在逐字翻译的问题，而人类翻译员有时过于思考背景信息。据我们所知，本研究是首次对LLMs与人类翻译员进行评估并分析其输出之间的系统差异，为我们了解基于LLM的翻译目前的状态和潜在限制提供了有价值的见解。

GPT-4 对人类翻译员的全面评估：跨语言、领域和专业水平的翻译质量