领域匹配的密集检索预训练任务

Jul, 2021

Domain-matched Pre-training Tasks for Dense Retrieval

Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin...

TL;DR通过在65百万个合成问题和2亿个来自Reddit对话的帖子对大型bi-encoder模型进行适当的预训练，可以在信息检索和对话检索基准测试中实现比监督基准线显着更好的表现。

Abstract

pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional