Since the natural language processing (NLP) community started to make large language models (LLMs), such as GPT-4, act as a critic to evaluate the quality of generated texts, most of them only train a critique generation model of a specific scale on specific datasets. We argue that a comprehensive investigation on the key factor of LLM-based evaluation models, such as scaling properties, is lacking, so that it is still inconclusive whether these models have potential to replace GPT-4's evaluation in practical scenarios. In this paper, we propose a new critique generation model called CritiqueLLM, which includes a dialogue-based prompting method for high-quality referenced / reference-free evaluation data. Experimental results show that our model can achieve comparable evaluation performance to GPT-4 especially in system-level correlations, and even outperform GPT-4 in 3 out of 8 tasks in a challenging reference-free setting. We conduct detailed analysis to show promising scaling properties of our model in the quality of generated critiques. We also demonstrate that our generated critiques can act as scalable feedback to directly improve the generation quality of LLMs.

自然语言处理社区开始让大规模语言模型（如GPT-4）扮演批评家以评估生成文本质量，大部分仅在特定数据集上训练特定规模的批判生成模型，我们认为缺乏对于基于语言模型评估模型的关键因素（如可扩展性特性）的全面调查，因此目前是否有潜力在实际场景中取代GPT-4的评估仍然没有结论；在本文中，我们提出了一种名为CritiqueLLM的新型批判生成模型，采用基于对话的提示方法用于高质量的参考/无参考评估数据，实验结果表明，我们的模型在评估性能上可以与GPT-4相媲美，尤其在系统级相关性上，甚至在具有挑战性的无参考环境中，在8个任务中有3个胜过GPT-4；我们进行详细分析以展示我们模型在生成批评质量方面的可扩展性特性，同时证明我们生成的批评可以作为可扩展反馈，直接提高LLM的生成质量。

CritiqueLLM: 扩展LLM-as-Critic以有效且可解释地评估大型语言模型生成