Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.

通过成对偏好判断，对生成的语言进行人工评估是普遍存在的。然而，在常见情况下，例如模型生成非常相似或随机解码导致生成变化较大时，会导致偏好评分不一致。我们通过引入元评估指标“可分性”来解决这些挑战，该指标估计了用于成对偏好评估的测试实例的适用性。通过对候选测试实例进行可分性采样，从模型对生成的多组中进行度量，以测量两组生成的可区分程度。我们的实验结果显示，具有较高可分性值的实例可从人工和自动评分器中获得更一致的偏好评分。此外，可分性的分布允许了解哪些测试基准对于比较模型更有价值。最后，我们将可分性纳入ELO评分中，考虑每个测试实例对LLM可靠排名的适用性。总体而言，可分性对于使用人工和自动评分器进行一致、高效和健壮的LLM偏好评估具有重要意义。

无需绝望的比较：可靠的偏好评估与生成分离