BriefGPT.xyz
Feb, 2025
研究LLM作为评审的非传递性
Investigating Non-Transitivity in LLM-as-a-Judge
HTML
PDF
Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk
TL;DR
本研究探讨了基于大型语言模型(LLMs)的自动评估方法中的非传递性问题,特别是在AlpacaEval框架下。我们发现LLM评审存在非传递偏好,这导致模型排名对基准模型的选择敏感。为解决这一问题,我们提出结合循环赛和Bradley-Terry偏好模型的方法,从而提高排名的可靠性,并且引入瑞士式迭代配对(Swim)以兼顾效率。
Abstract
Automatic Evaluation
methods based on
Large Language Models
(LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm,
→