In this paper, we first extend the recent Masked Auto-Encoder (MAE) model
from a single modality to audio-visual multi-modalities. Subsequently, we
propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining
contrastive learning and masked data modeling, two major self-supervised
learning frameworks, to learn a joint and coordinated audio-visual
representation. Our experiments show that the contrastive audio-visual
correspondence learning objective not only enables the model to perform
audio-visual retrieval tasks, but also helps the model learn a better joint
representation. As a result, our fully self-supervised pretrained CAV-MAE
achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the
previous best supervised pretrained model on AudioSet in the audio-visual event
classification task. Code and pretrained models are at
this https URL

本文提出了 CAV-MAE 模型，它将 Masked Auto-Encoder (MAE) 模型从单模态扩展到音频 - 视觉多模态，并结合自监督学习框架中的对比学习和蒙版数据建模两种方法，学习联合和协调的音频 - 视觉表示，并在 VGGSound 数据集中取得了新的 SOTA 准确性，达到了 65.9%。