In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches.

本文提出了一种新型任务，即利用注意力机制对视频中的自然语句进行弱监督的空时地定位，实现视频中与自然语句语义相符的空时地定位，同时引入多样性损失函数来加强可靠的实例-句子配对的匹配行为，并惩罚不可靠的行为，在ImageNet视频对象检测数据集上提供了一个新的基准数据集VID-sentence，并广泛实现了实验结果，表明我们的模型优于基线方法。

弱监督空时自然句子在视频中的基础