The past years have witnessed a proliferation of large language models
(LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to
the inaccuracy of standard metrics in reflecting human preferences and the
inefficiency in sampling informative and diverse test examples. While human
evaluation remains the gold standard, it is expensive and time-consuming,
especially when dealing with a large number of testing samples. To address this
problem, we propose a sample-efficient human evaluation method based on MAximum
Discrepancy (MAD) competition. MAD automatically selects a small set of
informative and diverse instructions, each adapted to two LLMs, whose responses
are subject to three-alternative forced choice by human subjects. The pairwise
comparison results are then aggregated into a global ranking using the Elo
rating system. We select eight representative LLMs and compare them in terms of
four skills: knowledge understanding, mathematical reasoning, writing, and
coding. Experimental results show that the proposed method achieves a reliable
and sensible ranking of LLMs' capabilities, identifies their relative strengths
and weaknesses, and offers valuable insights for further LLM advancement.

提出一种基于最大偏差（MAD）竞争的样本有效人工评估方法，用于评估大型语言模型的能力与相对优劣，并针对知识理解、数学推理、写作和编码等四种技能，提供有价值的进一步研究发展的见解。