In the automatic evaluation of generative question answering (GenQA) systems,
it is difficult to assess the correctness of generated answers due to the
free-form of the answer. Especially, widely used n-gram similarity metrics
often fail to discriminate the incorrect answers since they equally consider
all of the tokens. To alleviate this problem, we propose KPQA-metric, a new
metric for evaluating the correctness of GenQA. Specifically, our new metric
assigns different weights to each token via keyphrase prediction, thereby
judging whether a generated answer sentence captures the key meaning of the
reference answer. To evaluate our metric, we create high-quality human
judgments of correctness on two GenQA datasets. Using our human-evaluation
datasets, we show that our proposed metric has a significantly higher
correlation with human judgments than existing metrics. The code is available
at this https URL

本研究提出了一种新的度量模型 KPQA-metric，通过关键词预测为不同的令牌分配不同的权重评估生成的回答，用于评估生成式问答系统的正确性，并通过人类评估数据集表明，KPQA-metric 与现有度量模型具有更高的相关性。