Evaluation of multilingual Large Language Models (LLMs) is challenging due to
a variety of factors -- the lack of benchmarks with sufficient linguistic
diversity, contamination of popular benchmarks into LLM pre-training data and
the lack of local, cultural nuances in translated benchmarks. In this work, we
study human and LLM-based evaluation in a multilingual, multi-cultural setting.
We evaluate 30 models across 10 Indic languages by conducting 90K human
evaluations and 30K LLM-based evaluations and find that models such as GPT-4o
and Llama-3 70B consistently perform best for most Indic languages. We build
leaderboards for two evaluation settings - pairwise comparison and direct
assessment and analyse the agreement between humans and LLMs. We find that
humans and LLMs agree fairly well in the pairwise setting but the agreement
drops for direct assessment evaluation especially for languages such as Bengali
and Odia. We also check for various biases in human and LLM-based evaluation
and find evidence of self-bias in the GPT-based evaluator. Our work presents a
significant step towards scaling up multilingual evaluation of LLMs.

本研究评估了多语种大型语言模型的性能，发现 GPT-4o 和 Llama-3 70B 模型在大多数 Indic 语言中表现最佳。我们构建了两个评估设置的排行榜，并分析了人类评估和语言模型评估之间的一致性，发现在两两比较的设置下，人类和语言模型的一致性较高，但在直接评估中特别是对于孟加拉语和奥迪亚语等语言，一致性下降。我们还检测了人类和语言模型评估中的各种偏见，并发现 GPT 评估器存在自我偏见。本研究对多语种大型语言模型的评估具有重要意义。

PARIKSHA：多语言和跨文化数据上人类 LLM 评估者一致性的大规模调查

PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement  on Multilingual and Multi-Cultural Data

Proprietary LMs such as GPT-4 are often employed to assess the quality of
responses from various LMs. However, concerns including transparency,
controllability, and affordability strongly motivate the development of
open-source LMs specialized in evaluations. On the other hand, existing open
evaluator LMs exhibit critical shortcomings: 1) they issue scores that
significantly diverge from those assigned by humans, and 2) they lack the
flexibility to perform both direct assessment and pairwise ranking, the two
most prevalent forms of assessment. Additionally, they do not possess the
ability to evaluate based on custom evaluation criteria, focusing instead on
general attributes like helpfulness and harmlessness. To address these issues,
we introduce Prometheus 2, a more powerful evaluator LM than its predecessor
that closely mirrors human and GPT-4 judgements. Moreover, it is capable of
processing both direct assessment and pair-wise ranking formats grouped with a
user-defined evaluation criteria. On four direct assessment benchmarks and four
pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and
agreement with humans and proprietary LM judges among all tested open evaluator
LMs. Our models, code, and data are all publicly available at
this https URL

通过引入更强大的评估语言模型 Prometheus 2，我们解决了存在于开源评估语言模型中的问题，并达到了与人类和专有语言模型评价最高一致性和相似性的结果。

Prometheus 2：一个专门用于评估其他语言模型的开源语言模型

Prometheus 2: An Open Source Language Model Specialized in Evaluating  Other Language Models

In MT evaluation, pairwise comparisons are conducted to identify the better
system. In conducting the comparison, the experimenter must allocate a budget
to collect Direct Assessment (DA) judgments. We provide a cost effective way to
spend the budget, but show that typical budget sizes often do not allow for
solid comparison. Taking the perspective that the basis of solid comparison is
in achieving statistical significance, we study the power (rate of achieving
significance) on a large collection of pairwise DA comparisons. Due to the
nature of statistical estimation, power is low for differentiating less than
1-2 DA points, and to achieve a notable increase in power requires at least
2-3x more samples. Applying variance reduction alone will not yield these
gains, so we must face the reality of undetectable differences and spending
increases. In this context, we propose interim testing, an "early stopping"
collection procedure that yields more power per judgment collected, which
adaptively focuses the budget on pairs that are borderline significant. Interim
testing can achieve up to a 27% efficiency gain when spending 3x the current
budget, or 18% savings at the current evaluation power.

本文介绍了一种称为中期测试的评估方法，该方法在具有较小预算的情况下提供了更有效的评估方式，并可获得更高的评估功率和效率，这在机器翻译评估领域非常有用。

人类评估机器翻译中的超越力量

Searching for a higher power in the human evaluation of MT

We describe a novel method for efficiently eliciting scalar annotations for
dataset construction and system quality estimation by human judgments. We
contrast direct assessment (annotators assign scores to items directly), online
pairwise ranking aggregation (scores derive from annotator comparison of
items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels)
proposed here. Our proposal leads to increased correlation with ground truth,
at far greater annotator efficiency, suggesting this strategy as an improved
mechanism for dataset creation and manual system evaluation.

本文提出了一种高效的人工评分数据集构建以及系统质量评估方法，通过直接评估、网络成对排名汇总、混合法等三种评分方式来促进有效标注标量标签，并将其应用于数据集构建和系统评估，以提高与真值的相关性和评估的效率。