视频检索的多模态Transformer

Jul, 2020

Multi-modal Transformer for Video Retrieval

Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

TL;DR本文提出了一种基于多模态transformer架构的视频检索方法，该方法能够充分利用视频中的跨模态线索，并融合先前的时间信息。我们还研究了联合优化语言嵌入和多模态transformer的最佳实践。该方法在三个数据集上取得了最新的视频检索结果。

Abstract

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval pr