Dense retrievers for open-domain question answering (ODQA) have been shown to achieve impressive performance by training on large datasets of question-passage pairs. We investigate whether dense retrievers can be learned in a self-supervised fashion, and applied effectively without any annotations. We observe that existing pretrained models for retrieval struggle in this scenario, and propose a new pretraining scheme designed for retrieval: recurring span retrieval. We use recurring spans across passages in a document to create pseudo examples for contrastive learning. The resulting model -- Spider -- performs surprisingly well without any examples on a wide range of ODQA datasets, and is competitive with BM25, a strong sparse baseline. In addition, Spider often outperforms strong baselines like DPR trained on Natural Questions, when evaluated on questions from other datasets. Our hybrid retriever, which combines Spider with BM25, improves over its components across all datasets, and is often competitive with in-domain DPR models, which are trained on tens of thousands of examples.

本文介绍了一种基于无监督预训练的 ODQA 方法，通过 recurrent span retrieval 从文档中创建伪例子进行对比学习，控制 pseudo 查询和相关段落之间的术语重叠，从而允许模拟它们之间的词汇和语义关系，得到命名为“Spider”的模型，具有出色的性能，且不需要任何有标签的训练数据。

无监督学习检索文章段落