With the rising human-like precision of Large Language Models (LLMs) in
numerous tasks, their utilization in a variety of real-world applications is
becoming more prevalent. Several studies have shown that LLMs excel on many
standard NLP benchmarks. However, it is challenging to evaluate LLMs due to
test dataset contamination and the limitations of traditional metrics. Since
human evaluations are difficult to collect, there is a growing interest in the
community to use LLMs themselves as reference-free evaluators for subjective
metrics. However, past work has shown that LLM-based evaluators can exhibit
bias and have poor alignment with human judgments. In this study, we propose a
framework for an end-to-end assessment of LLMs as evaluators in multilingual
scenarios. We create a carefully curated dataset, covering 10 languages
containing native speaker judgments for the task of summarization. This dataset
is created specifically to evaluate LLM-based evaluators, which we refer to as
meta-evaluation (METAL). We compare the performance of LLM-based evaluators
created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that
LLM-based evaluators based on GPT-4 perform the best across languages, while
GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the
reasoning provided by LLM-based evaluators and find that it often does not
match the reasoning provided by human judges.

我们提出了一个针对多语言情景下 LLMs 作为评估器的端到端评估框架，并创建了一个用于评估 LLM-based 评估器的精心策划的数据集，该数据集覆盖 10 种语言，包含本族语言者对摘要任务的判断。我们比较了基于 GPT-3.5-Turbo、GPT-4 和 PaLM2 创建的 LLM-based 评估器的性能，结果表明，基于 GPT-4 的 LLM-based 评估器在各种语言中表现最好，而 GPT-3.5-Turbo 的表现不佳。此外，我们对 LLM-based 评估器提供的推理进行分析，发现它往往与人类评判所提供的推理不一致。

METAL：面向多语言元评估

METAL: Towards Multilingual Meta-Evaluation

Large Language Models (LLMs) have demonstrated impressive performance on
Natural Language Processing (NLP) tasks, such as Question Answering,
Summarization, and Classification. The use of LLMs as evaluators, that can rank
or score the output of other models (usually LLMs) has become increasingly
popular, due to the limitations of current evaluation techniques including the
lack of appropriate benchmarks, metrics, cost, and access to human annotators.
While LLMs are capable of handling approximately 100 languages, the majority of
languages beyond the top 20 lack systematic evaluation across various tasks,
metrics, and benchmarks. This creates an urgent need to scale up multilingual
evaluation to ensure a precise understanding of LLM performance across diverse
languages. LLM-based evaluators seem like the perfect solution to this problem,
as they do not require human annotators, human-created references, or
benchmarks and can theoretically be used to evaluate any language covered by
the LLM. In this paper, we investigate whether LLM-based evaluators can help
scale up multilingual evaluation. Specifically, we calibrate LLM-based
evaluation against 20k human judgments of five metrics across three
text-generation tasks in eight languages. Our findings indicate that LLM-based
evaluators may exhibit bias towards higher scores and should be used with
caution and should always be calibrated with a dataset of native speaker
judgments, particularly in low-resource and non-Latin script languages.

通过对大型语言模型的评估，本文发现 LLM-based evaluators 在多语言评估方面可能存在偏差，并需要使用本地语言的数据集进行校准。