Existing large language models (LLMs) evaluation methods typically focus on
testing the performance on some closed-environment and domain-specific
benchmarks with human annotations. In this paper, we explore a novel
unsupervised evaluation direction, utilizing peer-review mechanisms to measure
LLMs automatically. In this setting, both open-source and closed-source LLMs
lie in the same environment, capable of answering unlabeled questions and
evaluating each other, where each LLM's response score is jointly determined by
other anonymous ones. To obtain the ability hierarchy among these models, we
assign each LLM a learnable capability parameter to adjust the final ranking.
We formalize it as a constrained optimization problem, intending to maximize
the consistency of each LLM's capabilities and scores. The key assumption
behind is that high-level LLM can evaluate others' answers more accurately than
low-level ones, while higher-level LLM can also achieve higher response scores.
Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap
in aligning human rankings. We perform experiments on multiple datasets with
these metrics, validating the effectiveness of the proposed approach.

通过使用同行评审机制来自动测量大型语言模型的能力并评估其性能，我们提出了一种新颖的无监督评估方法，并通过为每个语言模型分配可学习的能力参数来调整最终排名，以最大化每个语言模型的能力和得分的一致性，并使用 PEN、CIN 和 LIS 三个指标来评估与人工评级的一致性差距，实验证明了该方法的有效性。