We present a framework for learning multimodal representations from unlabeled
data using convolution-free Transformer architectures. Specifically, our
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts
multimodal representations that are rich enough to benefit a variety of
downstream tasks. We train VATT end-to-end from scratch using multimodal
contrastive losses and evaluate its performance by the downstream tasks of
video action recognition, audio event classification, image classification, and
text-to-video retrieval. Furthermore, we study a modality-agnostic,
single-backbone Transformer by sharing weights among the three modalities. We
show that the convolution-free VATT outperforms state-of-the-art ConvNet-based
architectures in the downstream tasks. Especially, VATT's vision Transformer
achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,
72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding
supervised pre-training. Transferring to image classification leads to 78.7%
top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer
from scratch, showing the generalizability of our model despite the domain gap
between videos and images. VATT's audio Transformer also sets a new record on
waveform-based audio event recognition by achieving the mAP of 39.4% on
AudioSet without any supervised pre-training. VATT's source code is publicly
available.

我们提出了一种使用无标签数据学习多模态表示的框架，利用无卷积的 Transformer 架构。通过使用多模态对比损失训练 Video-Audio-Text Transformer (VATT)，我们从三个模态中提取丰富的多模态表示，并在视频动作识别、音频事件分类、图像分类和文本到视频检索等下游任务中对其性能进行评估。VATT 不需要监督预训练，其视觉 Transformer 在 Kinetics-400 上实现了 82.1%、在 Kinetics-600 上实现了 83.6%、在 Kinetics-700 上实现了 72.7%、在 Moments in Time 上实现了 41.1% 的最高准确率，并且将 VATT 迁移至图像分类任务中，其 ImageNet 的最高准确率达到了 78.7%。VATT 的音频 Transformer 在 AudioSet 上实现了 39.4% 的 mAP，而不需要监督预训练，表现出模型的泛化能力。

VATT：用于原始视频、音频和文本的多模态自监督学习的 Transformer 模型

VATT: Transformers for Multimodal Self-Supervised Learning from Raw  Video, Audio and Text

We propose a new deep network for audio event recognition, called AENet. In
contrast to speech, sounds coming from audio events may be produced by a wide
variety of sources. Furthermore, distinguishing them often requires analyzing
an extended time period due to the lack of clear sub-word units that are
present in speech. In order to incorporate this long-time frequency structure
of audio events, we introduce a convolutional neural network (CNN) operating on
a large temporal input. In contrast to previous works this allows us to train
an audio event detection system end-to-end. The combination of our network
architecture and a novel data augmentation outperforms previous methods for
audio event detection by 16%. Furthermore, we perform transfer learning and
show that our model learnt generic audio features, similar to the way CNNs
learn generic features on vision tasks. In video analysis, combining visual
features and traditional audio features such as MFCC typically only leads to
marginal improvements. Instead, combining visual features with our AENet
features, which can be computed efficiently on a GPU, leads to significant
performance improvements on action recognition and video highlight detection.
In video highlight detection, our audio features improve the performance by
more than 8% over visual features alone.

提出了一种新的深度网络用于音频事件识别，名为 AENet，该网络采用卷积神经网络以在时间维度上对音频事件进行长时间频率结构的建模来训练端到端的音频事件检测系统，在事件识别、动作识别和视频亮点检测等视听任务中，结合 AENet 特征和视觉特征效果显著。

AENet: 学习视频分析的深度音频特征

AENet: Learning Deep Audio Features for Video Analysis

We present in this paper a simple, yet efficient convolutional neural network
(CNN) architecture for robust audio event recognition. Opposing to deep CNN
architectures with multiple convolutional and pooling layers topped up with
multiple fully connected layers, the proposed network consists of only three
layers: convolutional, pooling, and softmax layer. Two further features
distinguish it from the deep architectures that have been proposed for the
task: varying-size convolutional filters at the convolutional layer and 1-max
pooling scheme at the pooling layer. In intuition, the network tends to select
the most discriminative features from the whole audio signals for recognition.
Our proposed CNN not only shows state-of-the-art performance on the standard
task of robust audio event recognition but also outperforms other deep
architectures up to 4.5% in terms of recognition accuracy, which is equivalent
to 76.3% relative error reduction.

本研究提出了一种简单而高效的卷积神经网络（CNN）架构，用于鲁棒的音频事件识别，并采用变尺寸的卷积滤波器和 1-max 池化方案等创新特性，在标准的鲁棒音频事件识别任务上表现不但表现出最新技术水准，并且在识别准确率上优于其他深度网络架构 4.5％， 相当于 76.3％的相对误差降低。