BriefGPT.xyz
Feb, 2024
超越概率:揭示大型语言模型评估中的不一致性
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
HTML
PDF
Chenyang Lyu, Minghao Wu, Alham Fikri Aji
TL;DR
使用大型语言模型(LLMs)进行多项选择题(MCQs)的实证研究表明,概率评估方法在生成预测方面存在内在局限性,与当前评估框架通常基于输出概率而非直接生成回应的计算限制相关,结果强调了LLMs评估方法的有效性和未来研究的启示。
Abstract
large language models
(LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent
evaluation framewo
→