LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.

本研究主要探讨了作为评审的LLM和奖励模型在非英语环境中的有效性，填补了相关研究的空白。我们发现，英语评估能力往往对语言特定能力产生更大影响，而LLM在识别和惩罚实事求是错误及文化错位等方面存在显著不足。此外，本文发布了Kudge，这是一份包含5012个韩文人类注释的非英语元评估数据集。

作为评审的LLM与奖励模型：它们能做什么，不能做什么