In the automatic evaluation of generative question answering (GenQA) systems,
it is difficult to assess the correctness of generated answers due to the
free-form of the answer. Especially, widely used n-gram similarity metrics
often fail to discriminate the incorrect answers since they equally consider
all of the tokens. To alleviate this problem, we propose KPQA-metric, a new
metric for evaluating the correctness of GenQA. Specifically, our new metric
assigns different weights to each token via keyphrase prediction, thereby
judging whether a generated answer sentence captures the key meaning of the
reference answer. To evaluate our metric, we create high-quality human
judgments of correctness on two GenQA datasets. Using our human-evaluation
datasets, we show that our proposed metric has a significantly higher
correlation with human judgments than existing metrics. The code is available
at this https URL

本研究提出了一种新的度量模型 KPQA-metric，通过关键词预测为不同的令牌分配不同的权重评估生成的回答，用于评估生成式问答系统的正确性，并通过人类评估数据集表明，KPQA-metric 与现有度量模型具有更高的相关性。

使用关键词权重的生成式问答度量 KPQA

KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

There has always been criticism for using $n$-gram based similarity metrics,
such as BLEU, NIST, etc, for evaluating the performance of NLG systems.
However, these metrics continue to remain popular and are recently being used
for evaluating the performance of systems which automatically generate
questions from documents, knowledge graphs, images, etc. Given the rising
interest in such automatic question generation (AQG) systems, it is important
to objectively examine whether these metrics are suitable for this task. In
particular, it is important to verify whether such metrics used for evaluating
AQG systems focus on answerability of the generated question by preferring
questions which contain all relevant information such as question type
(Wh-types), entities, relations, etc. In this work, we show that current
automatic evaluation metrics based on $n$-gram similarity do not always
correlate well with human judgments about answerability of a question. To
alleviate this problem and as a first step towards better evaluation metrics
for AQG, we introduce a scoring function to capture answerability and show that
when this scoring function is integrated with existing metrics, they correlate
significantly better with human judgments. The scripts and data developed as a
part of this work are made publicly available at
this https URL

本文研究了使用 $n$-gram 相识度量（例如 BLEU、NIST 等）来评估自然语言生成（NLG）系统的性能，尤其是其在从文档、知识图谱、图像等中自动生成问题的系统中的应用。研究发现，当前的自动评估指标并不总是能够客观地评估生成问题的可回答性。为了解决这个问题，本文提出了一个得分函数，并将其与现有的度量标准进行整合，这些度量标准与人类判断的相关性显著提高。