We present a lifelong audio-video masked autoencoder that continually learns
the multimodal representations from a video stream containing audio-video
pairs, while its distribution continually shifts over time. Specifically, we
propose two novel ideas to tackle the problem: (1) Localized Alignment: We
introduce a small trainable multimodal encoder that predicts the audio and
video tokens that are well-aligned with each other. This allows the model to
learn only the highly correlated audiovisual patches with accurate multimodal
relationships. (2) Forget-robust multimodal patch selection: We compare the
relative importance of each audio-video patch between the current and past data
pair to mitigate unintended drift of the previously learned audio-video
representations. Our proposed method, FLAVA (Forget-robust Localized
Audio-Video Alignment), therefore, captures the complex relationships between
the audio and video modalities during training on a sequence of pre-training
tasks while alleviating the forgetting of learned audiovisual correlations. Our
experiments validate that FLAVA outperforms the state-of-the-art continual
learning methods on several benchmark datasets under continual audio-video
representation learning scenarios.

我们提出了一种终身音视频遮罩自编码器，它能够在分布随时间不断变化的视频流中持续学习多模态表示。具体而言，我们提出了两个创新点来解决这个问题：（1）局部对齐：我们引入了一个小型可训练的多模态编码器，用于预测彼此相互对齐的音频和视频令牌。这使得模型只学习具有准确多模态关系的高度相关的音视频补丁。（2）忘却鲁棒多模态补丁选择：我们比较当前数据对之间每个音视频补丁的相对重要性，以减轻先前学习的音视频表示的意外漂移。因此，我们提出的方法 FLAVA 在一系列预训练任务上训练期间捕捉音频和视频模态之间的复杂关系，并减轻了已学习音视频相关性的遗忘。我们的实验证实了 FLAVA 在持续音视频表示学习场景下的几个基准数据集上优于现有的持续学习方法。