We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.

我们提出了一种新的“双度量”框架，用于设计最近邻数据结构。我们的框架基于两个不相似性函数：一个准确但计算代价高的基准度量，和一个廉价但不太准确的代理度量。我们在理论和实践中展示了如何仅使用代理度量构建数据结构，使查询过程达到基准度量的准确性，同时只使用有限次对两个度量的调用。我们的理论结果在两个最流行的最近邻搜索算法（DiskANN和Cover Tree）中实例化了该框架。对于任意一个这两个算法，只要用于构建数据结构的代理度量相对于基准度量有界因子的近似，我们的数据结构都能在基准度量方面获得任意好的近似保证。在实证方面，我们将该框架应用于具有计算代价差异的两个机器学习模型评估的文本检索问题。我们观察到，在MTEB基准测试中，对于几乎所有的数据集，我们的方法能够在准确度和效率之间获得相比其他方法（如重新排序）更好的平衡。

一种用于快速相似搜索的双指标框架