Using large language models (LLMs) to evaluate text quality has recently
gained popularity. Some prior works explore the idea of using LLMs for
evaluation, while they differ in some details of the evaluation process. In
this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et
al., 2023), and we discuss how those details in the evaluation process change
how well the ratings given by LLMs correlate with human ratings. We find that
the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more
aligned with human ratings. We also show that forcing the LLM to output only a
numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the
LLM to explain its own ratings consistently improves the correlation between
the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations
on two meta-evaluation datasets.

使用大型语言模型（LLMs）评估文本质量近来变得流行。本文分析了 LLM 评估（Chiang 和 Lee，2023）和 G-Eval（Liu et al.，2023），讨论了评估过程中的细节如何改变 LLMs 给出的评分与人类评分的相关性。我们发现 G-Eval 中使用的自动思维链（CoT）并不总是使 G-Eval 与人类评分更加一致。我们还表明，强制 LLM 仅输出数字评分，如 G-Eval 中所示，是不理想的。最后，我们揭示出要求 LLM 解释其自身评分会持续改善 ChatGPT 与人类评分之间的相关性，并在两个元评估数据集上推动了最新技术的相关性。