BriefGPT.xyz
Apr, 2024
通过最大差异竞争实现对大型语言模型的高效人工评估
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
HTML
PDF
Kehua Feng, Keyan Ding, Kede Ma, Zhihua Wang, Qiang Zhang...
TL;DR
提出一种基于最大偏差(MAD)竞争的样本有效人工评估方法,用于评估大型语言模型的能力与相对优劣,并针对知识理解、数学推理、写作和编码等四种技能,提供有价值的进一步研究发展的见解。
Abstract
The past years have witnessed a proliferation of
large language models
(LLMs). Yet, automated and
unbiased evaluation
of LLMs is challenging due to the inaccuracy of standard metrics in reflecting human preferenc
→