The impressive performance of large language models (LLMs) has attracted
considerable attention from the academic and industrial communities. Besides
how to construct and train LLMs, how to effectively evaluate and compare the
capacity of LLMs has also been well recognized as an important yet difficult
problem. Existing paradigms rely on either human annotators or model-based
evaluators to evaluate the performance of LLMs on different tasks. However,
these paradigms often suffer from high cost, low generalizability, and
inherited biases in practice, which make them incapable of supporting the
sustainable development of LLMs in long term. In order to address these issues,
inspired by the peer review systems widely used in academic publication
process, we propose a novel framework that can automatically evaluate LLMs
through a peer-review process. Specifically, for the evaluation of a specific
task, we first construct a small qualification exam to select "reviewers" from
a couple of powerful LLMs. Then, to actually evaluate the "submissions" written
by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to
rate or compare the submissions. The final ranking of evaluatee LLMs is
generated based on the results provided by all reviewers. We conducted
extensive experiments on text summarization tasks with eleven LLMs including
GPT-4. The results demonstrate the existence of biasness when evaluating using
a single LLM. Also, our PRE model outperforms all the baselines, illustrating
the effectiveness of the peer review mechanism.

通过同行评审机制，我们提出了一种能够自动评估大型语言模型的新框架，用于解决评估成本高、泛化能力低以及评估中的偏见等问题。我们在文本摘要任务上进行了广泛实验，结果表明使用单一语言模型评估存在偏见，并证明了我们的同行评审机制的有效性。