With the rapid development of large language models (LLMs), how to efficiently evaluate them has become an important research question. Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. To address these limitations, our study introduces the Auto-PRE, an automatic LLM evaluation framework based on peer review. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluator LLMs automatically based on their inherent traits including consistency, self-confidence, and pertinence. We conduct extensive experiments on three tasks: summary generation, non-factoid question-answering, and dialogue generation. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost. Moreover, our study highlights the impact of prompt strategies and evaluation formats on evaluation performance, offering guidance for method optimization in the future.

本研究解决了大语言模型（LLMs）评估中的高成本和系统性偏见问题，提出了一种基于同行评审的自动化评估框架Auto-PRE。研究发现，Auto-PRE在三个任务上的实验结果表明，其在成本较低的情况下实现了最新的评估性能，并且突出了提示策略和评估格式对评估效果的影响，为未来方法优化提供了指导。

一种自动化和成本效益高的语言生成评估同行评审框架