Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

使用大型语言模型（LLMs）评估文本质量近来变得流行。本文分析了LLM评估（Chiang和Lee，2023）和G-Eval（Liu et al.，2023），讨论了评估过程中的细节如何改变LLMs给出的评分与人类评分的相关性。我们发现G-Eval中使用的自动思维链（CoT）并不总是使G-Eval与人类评分更加一致。我们还表明，强制LLM仅输出数字评分，如G-Eval中所示，是不理想的。最后，我们揭示出要求LLM解释其自身评分会持续改善ChatGPT与人类评分之间的相关性，并在两个元评估数据集上推动了最新技术的相关性。

大规模语言模型在自动评估中的深入研究