In a number of question answering (QA) benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available. We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering. To address this, we propose a new pretraining scheme that is more suitable for extractive question answering. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks, e.g., 72.7 F1 with only 128 examples on SQuAD, while maintaining competitive (and sometimes better) performance in the high-resource setting. Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.

在几个问答基准测试中，经过Fine-Tuning后，预训练模型已经达到了与人类相当的水平。然而我们研究了更为现实的少样本情况，发现标准的模型表现不佳，由此突出了当前预训练目标与问答之间的差异。为此我们提出了一种新的问答针对性预训练方案：Recurring Span Selection，该方案非常适合处理具有多个重复区域的段落，并在提供的数据量很少的情况下在SQuAD的基准测试中取得了令人惊讶的高成绩（例如仅使用128个训练示例时即可获得72.7 F1的成绩），同时保持了在高资源设置下具有相当的性能。

预训练跨度选择的少样本问题回答