Retrieval is a core component for open-domain NLP tasks. In open-domain tasks, multiple entities can share a name, making disambiguation an inherent yet under-explored problem. We propose an evaluation benchmark for assessing the entity disambiguation capabilities of these retrievers, which we call Ambiguous Entity Retrieval (AmbER) sets. We define an AmbER set as a collection of entities that share a name along with queries about those entities. By covering the set of entities for polysemous names, AmbER sets act as a challenging test of entity disambiguation. We create AmbER sets for three popular open-domain tasks: fact checking, slot filling, and question answering, and evaluate a diverse set of retrievers. We find that the retrievers exhibit popularity bias, significantly under-performing on rarer entities that share a name, e.g., they are twice as likely to retrieve erroneous documents on queries for the less popular entity under the same name. These experiments on AmbER sets show their utility as an evaluation tool and highlight the weaknesses of popular retrieval systems.

为了评估检索器的实体消歧能力，我们提出了一种评估基准（AmbER）集。我们在这项研究中使用AmbER集为三种流行的开放域任务创建和评估检索器，并发现检索器存在受欢迎程度偏差，对于名称下不那么流行的实体的检索性能明显下降。AmbER集显示其作为评估工具的实用性，并强调了流行检索系统的弱点。

评估实体消歧和流行度在检索导向的自然语言处理中的作用