Large language models (LLMs) are commonly used as evaluators in tasks (e.g., reward modeling, LLM-as-a-judge), where they act as proxies for human preferences or judgments. This leads to the need for meta-evaluation: evaluating the credibility of LLMs as evaluators. However, existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts. To address this, we introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories. MM-Eval evaluates various dimensions, including language-specific challenges like linguistics and language hallucinations. Evaluation results show that both proprietary and open-source language models have considerable room for improvement. Further analysis reveals a tendency for these models to assign middle-ground scores to low-resource languages. We publicly release our benchmark and code.

本研究针对大型语言模型在非英语环境中作为评估工具效果不足的问题，提出了一个多语言的评价基准MM-Eval，该基准覆盖了18种语言和六种类别。研究发现，现有语言模型在非英语评估中的效能有显著提升空间，并且存在对低资源语言给予中间分数的倾向。

MM-Eval：一种多语言元评估基准，用于将大型语言模型作为评审者和奖励模型