The past years have witnessed a proliferation of large language models
(LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to
the inaccuracy of standard metrics in reflecting human preferences and the
inefficiency in sampling informative and diverse test examples. While human
evaluation remains the gold standard, it is expensive and time-consuming,
especially when dealing with a large number of testing samples. To address this
problem, we propose a sample-efficient human evaluation method based on MAximum
Discrepancy (MAD) competition. MAD automatically selects a small set of
informative and diverse instructions, each adapted to two LLMs, whose responses
are subject to three-alternative forced choice by human subjects. The pairwise
comparison results are then aggregated into a global ranking using the Elo
rating system. We select eight representative LLMs and compare them in terms of
four skills: knowledge understanding, mathematical reasoning, writing, and
coding. Experimental results show that the proposed method achieves a reliable
and sensible ranking of LLMs' capabilities, identifies their relative strengths
and weaknesses, and offers valuable insights for further LLM advancement.

提出一种基于最大偏差（MAD）竞争的样本有效人工评估方法，用于评估大型语言模型的能力与相对优劣，并针对知识理解、数学推理、写作和编码等四种技能，提供有价值的进一步研究发展的见解。

通过最大差异竞争实现对大型语言模型的高效人工评估

Sample-Efficient Human Evaluation of Large Language Models via Maximum  Discrepancy Competition

In Natural Language Processing (NLP), the Elo rating system, originally
designed for ranking players in dynamic games such as chess, is increasingly
being used to evaluate Large Language Models (LLMs) through "A vs B" paired
comparisons. However, while popular, the system's suitability for assessing
entities with constant skill levels, such as LLMs, remains relatively
unexplored. We study two fundamental axioms that evaluation methods should
adhere to: reliability and transitivity. We conduct extensive evaluation of Elo
behaviour, illustrating that individual Elo computations exhibit volatility and
delving into the impact of varying the Elo rating system's hyperparameters. We
show that these axioms are not always satisfied raising questions about the
reliability of current comparative evaluations of LLMs. If the current use of
Elo scores is intended to substitute the costly head-to-head comparison of
LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by
the axioms, our findings offer concrete guidelines for enhancing the
reliability of LLM evaluation methods, suggesting a need for reassessment of
existing comparative approaches.

在自然语言处理 (NLP) 中，Elo 等级系统被用于评估大型语言模型 (LLMs) 的准确性和可靠性，然而其在评估具有恒定技能水平，如 LLMs 等实体方面的适用性仍然相对未被探索。本文研究了评估方法应遵循的两个基本公理：可靠性和传递性，并通过广泛的 Elo 行为评估，阐述了个体 Elo 计算的波动性，并深入探讨了 Elo 等级系统超参数变化的影响。我们发现这些公理并不总是得到满足，提出了当前 LLMs 的比较评估的可靠性问题。如果当前使用 Elo 得分来替代昂贵的 LLMs 比较，确保排名尽可能健壮是至关重要的。我们的研究结果根据这些公理为改进 LLMs 评估方法提供了具体指导，这意味着需要重新评估现有的比较方法。