BriefGPT.xyz
Jun, 2024
LLMs分类性能被夸大
LLMs' Classification Performance is Overclaimed
HTML
PDF
Hanzi Xu, Renze Lou, Jiangshu Du, Vahid Mahzoon, Elmira Talebianaraki...
TL;DR
该研究评估了闭源和开源的大型语言模型在典型分类任务中的表现,讨论了大型语言模型在没有正确标签的情况下理解任务本质的能力,并提出了一个新的测试基准和评估指标。
Abstract
In many
classification tasks
designed for AI or human to solve,
gold labels
are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has tr
→