Automatic question generation (AQG) has broad applicability in domains such as tutoring systems, conversational agents, healthcare literacy, and information retrieval. Existing efforts at AQG have been limited to short answer lengths of up to two or three sentences. However, several real-world applications require question generation from answers that span several sentences. Therefore, we propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers. We leverage the large-scale open-source Google Natural Questions dataset to create the aforementioned long-answer AQG benchmark. We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases. Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation. However, we still observe degradation in the performance of our best performing models with increasing sentence length, suggesting that long answer QA is a challenging benchmark task for future research.

提出了一种新的评估基准用于评估现有的自动生成问题系统的性能，特别是长文本答案下的自动生成问题系统。研究表明，随着答案长度的增加，现有 AQG 方法的性能显著下降，变压器模型在长答案方面的表现优于其他 AQG 方法，但仍存在性能下降的情况，这表明长答案 QA 是未来研究的具有挑战性的基准任务。

自动从长答案生成问题的研究