Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of parameters and early query-document interaction play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that rerankers largely outperform dense ones of similar size in several tasks. Our largest reranker reaches the state of the art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the previous state of the art by 3 average points. Finally, we confirm that in-domain effectiveness is not a good indicator of zero-shot effectiveness. Code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git

本研究表明，在信息检索任务中，经过裁剪的小型语言模型是大型、速度慢得多的模型的强有力竞争者。在信息检索实际应用中，由于延迟限制，压缩和稠密模型成为首选。然而，通过实验，我们发现模型大小和早期查询文档交互对检索模型的泛化能力起着重要作用。增加模型大小在相同领域的测试数据集上几乎没有增益，但在从未在训练中见过的新领域上存在更大的提高。此外，我们还显示 reranker 在几个任务中大大优于其大小相似的稠密模型。我们的最大 reranker 在 Benchmark-IR（BEIR）的 18 个数据集中的12个数据集中达到了最先进水平，平均超过了以前的最优结果3个点。最后，我们证实，域内有效性不是零-shot有效性的好指标。

不让任何参数落下: 蒸馏和模型大小对零-shot检索的影响