Recently, large language model (LLM)-based preference evaluation has been
widely adopted to compare pairs of model responses. However, a severe bias
towards lengthy responses has been observed, raising concerns about the
reliability of this evaluation method. In this work, we designed a series of
controlled experiments to study the major impacting factors of the metric of
LLM-based preference evaluation, i.e., win rate, and conclude that the win rate
is affected by two axes of model response: desirability and information mass,
where the former is length-independent and related to trustworthiness, and the
latter is length-dependent and can be represented by conditional entropy. We
find that length impacts the existing evaluations by influencing information
mass. However, a reliable evaluation metric should not only assess content
quality but also ensure that the assessment is not confounded by extraneous
factors such as response length. Therefore, we propose a simple yet effective
adjustment, AdapAlpaca, to the existing practice of win rate measurement.
Specifically, by adjusting the lengths of reference answers to match the test
model's answers within the same interval, we debias information mass relative
to length, ensuring a fair model evaluation.

最近，使用大型语言模型（LLM）进行偏好评估已被广泛采用来比较模型回答的优劣。然而，观察到一种严重偏向较长回答的偏差，引发了对这种评估方法可靠性的关注。通过一系列实验，我们设计了这项工作，研究了 LLM-based 偏好评估指标的主要影响因素，即胜率，并得出结论：胜率受到模型回答的两个方面的影响：可取性和信息量，其中前者与长度无关且与可信度相关，而后者与长度相关且可以用条件熵表示。我们发现，长度通过影响信息量而影响现有的评估。然而，一个可靠的评估指标不仅应评估内容质量，还应确保评估不会受到回答长度等外部因素的干扰。因此，我们提出了一种简单而有效的调整方法 AdapAlpaca，用于现有的胜率测量实践。具体而言，通过调整参考答案的长度以与测试模型的答案在相同区间内相匹配，我们可以消除信息量相对长度的偏差，确保公平的模型评估。