BriefGPT.xyz
May, 2023
在大语言模型时代评估开放领域问答
Evaluating Open-Domain Question Answering in the Era of Large Language Models
HTML
PDF
Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei
TL;DR
通过人工评估,我们发现使用InstructGPT在NQ-open取得了新的最优结果,且所有模型的真实性能均被显著低估,同时超过50%的词汇匹配失败归因于意义相当的答案, 正则匹配排名与人类判断一致
Abstract
lexical matching
remains the de facto
evaluation
method for open-domain question answering (QA). Unfortunately,
lexical matching
fails com
→