This study focuses on the evaluation of Open Question Answering (Open-QA)
tasks, which have become vital in the realm of artificial intelligence. Current
automatic evaluation methods have shown limitations, indicating that human
evaluation still remains the most reliable approach. We introduce a new task,
QA Evaluation (QA-Eval), designed to assess the accuracy of AI-generated
answers in relation to standard answers within Open-QA. Our evaluation of these
methods utilizes human-annotated results, and we employ accuracy and F1 score
to measure their performance. Specifically, the work investigates methods that
show high correlation with human evaluations, deeming them more reliable. We
also discuss the pitfalls of current methods, such as their inability to
accurately judge responses that contain excessive information. The dataset
generated from this work is expected to facilitate the development of more
effective automatic evaluation tools. We believe this new QA-Eval task and
corresponding dataset will prove valuable for future research in this area.

本研究针对认知智能领域中的 Open Question Answering 任务进行评估，提出了 QA Evaluation 任务和相应的数据集，在考虑到自动评估方法的局限性的基础上，采用人工评估来更准确地衡量基于人工智能的答案的准确性和 F1 分数，并研究表现高度相关且更可靠的评估方法以及当前方法的缺陷，最终生成的数据集有望促进更有效的自动评估工具的发展。

评估开放式问答系统评估

Evaluating Open Question Answering Evaluation

A flaw in QA evaluation is that annotations often only provide one gold
answer. Thus, model predictions semantically equivalent to the answer but
superficially different are considered incorrect. This work explores mining
alias entities from knowledge bases and using them as additional gold answers
(i.e., equivalent answers). We incorporate answers for two settings: evaluation
with additional answers and model training with equivalent answers. We analyse
three QA benchmarks: Natural Questions, TriviaQA, and SQuAD. Answer expansion
increases the exact match score on all datasets for evaluation, while
incorporating it helps model training over real-world datasets. We ensure the
additional answers are valid through a human post hoc evaluation.

本文探讨了利用知识库中的别名实体作为额外的 “金标准答案” 来提高 QA 系统的评估和训练效果，并在三个 QA 基准数据集上验证了其有效性。

公开域问答的答案等效性问题

What's in a Name? Answer Equivalence For Open-Domain Question Answering

When Question-Answering (QA) systems are deployed in the real world, users
query them through a variety of interfaces, such as speaking to voice
assistants, typing questions into a search engine, or even translating
questions to languages supported by the QA system. While there has been
significant community attention devoted to identifying correct answers in
passages assuming a perfectly formed question, we show that components in the
pipeline that precede an answering engine can introduce varied and considerable
sources of error, and performance can degrade substantially based on these
upstream noise sources even for powerful pre-trained QA models. We conclude
that there is substantial room for progress before QA systems can be
effectively deployed, highlight the need for QA evaluation to expand to
consider real-world use, and hope that our findings will spur greater community
interest in the issues that arise when our systems actually need to be of
utility to humans.

本文研究 Question-Answering 系统在实际部署中的问题，发现在回答引擎之前的管道部件可能会引入多样化且可观的错误，而且即使是针对强大的预训练 QA 模型，性能也会因为这些上游噪声源而显著降低。作者认为在 QA 系统能够真正有效部署之前，还有很大的改进空间。因此，他们强调 QA 评估需要扩展到考虑实际使用情况，并希望他们的研究结果能引起更广泛的关注。