In this paper, we show that recent advances in video representation learning
and pre-trained vision-language models allow for substantial improvements in
self-supervised video object localization. We propose a method that first
localizes objects in videos via a slot attention approach and then assigns text
to the obtained slots. The latter is achieved by an unsupervised way to read
localized semantic information from the pre-trained CLIP model. The resulting
video object localization is entirely unsupervised apart from the implicit
annotation contained in CLIP, and it is effectively the first unsupervised
approach that yields good results on regular video benchmarks.

通过在视频中定位对象的插槽注意力方法以及利用预训练的 CLIP 模型实现无监督视频对象定位，我们展示了近期视频表征学习和预训练视觉语言模型的重要进展，取得了显著的提升，并成为首个在常规视频基准数据集上具有良好结果的无监督方法。