This study focuses on the evaluation of Open Question Answering (Open-QA) tasks, which have become vital in the realm of artificial intelligence. Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, QA Evaluation (QA-Eval), designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results, and we employ accuracy and F1 score to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods, such as their inability to accurately judge responses that contain excessive information. The dataset generated from this work is expected to facilitate the development of more effective automatic evaluation tools. We believe this new QA-Eval task and corresponding dataset will prove valuable for future research in this area.

本研究针对认知智能领域中的Open Question Answering任务进行评估，提出了QA Evaluation任务和相应的数据集，在考虑到自动评估方法的局限性的基础上，采用人工评估来更准确地衡量基于人工智能的答案的准确性和F1分数，并研究表现高度相关且更可靠的评估方法以及当前方法的缺陷，最终生成的数据集有望促进更有效的自动评估工具的发展。

评估开放式问答系统评估