We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA). RoMQA contains clusters of questions that are derived from related constraints mined from the Wikidata knowledge graph. RoMQA evaluates robustness of QA models to varying constraints by measuring worst-case performance within each question cluster. Compared to prior QA datasets, RoMQA has more human-written questions that require reasoning over more evidence text and have, on average, many more correct answers. In addition, human annotators rate RoMQA questions as more natural or likely to be asked by people. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging: zero-shot and few-shot models perform similarly to naive baselines, while supervised retrieval methods perform well below gold evidence upper bounds. Moreover, existing models are not robust to variations in question constraints, but can be made more robust by tuning on clusters of related questions. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.

RoMQA是第一个鲁棒、多证据、多答案问答基准测试，它基于Wikidata知识图谱的相关限制生成问题集群，并通过测量每个问题集群中的最坏性能来评估QA模型对各种限制的鲁棒性。与先前的QA数据集相比，RoMQA具有更多需要对更多证据文本进行推理的人类编写问题，并且平均有更多正确答案。此外，人类注释员评价RoMQA问题更自然或更有可能被人们问到。

RoMQA: 鲁棒性、多证据、多答案问答基准