There is a natural correlation between the visual and auditive elements of a
video. In this work we leverage this connection to learn general and effective
models for both audio and video analysis from self-supervised temporal
synchronization. We demonstrate that a calibrated curriculum learning scheme, a
careful choice of negative examples, and the use of a contrastive loss are
critical ingredients to obtain powerful multi-sensory representations from
models optimized to discern temporal synchronization of audio-video pairs.
Without further finetuning, the resulting audio features achieve performance
superior or comparable to the state-of-the-art on established audio
classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual
subnet provides a very effective initialization to improve the accuracy of
video-based action recognition models: compared to learning from scratch, our
self-supervised pretraining yields a remarkable gain of +19.9% in action
recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

该研究通过自监督的时间同步学习模型实现音频和视频分析的目的，模型能够在没有微调的情况下有效地识别出时序同步的音频 - 视频配对，并提供了一种非常有效的初始化方式以改善基于视频的动作识别模型的准确性。