With the explosion of multimedia content in recent years, natural language video localization, which focuses on detecting video moment that matches a given natural language query, has become a critical problem. However, none of the previous research explores localizing a moment from a large corpus where multiple positive and negative videos exist. In this paper, we propose an MVMR (Massive Videos Moment Retrieval) task, which aims to localize video frames from a massive set of videos given a text query. For this task, we suggest methods for constructing datasets by employing similarity filtering on the existing video localization datasets and introduce three MVMR datasets. Specifically, we employ embedding-based text similarity matching and video-language grounding techniques to calculate the relevance score between a target query and videos to define positive and negative sets. For the proposed MVMR task, we further develop a strong model, Reliable Mutual Matching Network (RMMN), which employs a contrastive learning scheme that selectively filters the reliable and informative negatives leading the model more robust on the MVMR task. Experimental results on the introduced datasets reveal that existing NLVL models are easily distracted by negative video frames, whereas our model shows significant performance.

近年来，多媒体内容的爆炸性增长使得自然语言视频本地化成为一个关键性的问题。本文引入了一个大规模视频时刻检索（MVMR）任务，用于从大量视频中定位视频帧。我们提出了一种构建数据集的方法，并介绍了三个MVMR数据集。针对这个任务，我们还开发了一个强大的模型，即可靠的互补匹配网络（RMMN），该模型通过对准确有效的负样本进行对比学习来提高鲁棒性。实验结果表明，与现有的NLVL模型相比，我们的模型在MVMR任务中表现出显著的性能优势。

MVMR: 对多个可靠视频池的评估自然语言视频定位偏见