Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

本文提出一种基于多模态、模态无关的融合变压器方法，通过交换多个模态之间的信息并将其整合成一个联合的多模态表示，从而获得聚合多模态时态信息的嵌入，可用于零-shot检索和分类。我们在HowTo100M数据集上训练模型，并在四个具有挑战性的基准数据集上评估结果，取得了零-shot视频检索和零-shot视频行动定位的最新成果。

一次搞定——用于视频检索的多模态融合Transformer