The intuitive interaction between the audio and visual modalities is valuable
for cross-modal self-supervised learning. This concept has been demonstrated
for generic audiovisual tasks like video action recognition and acoustic scene
classification. However, self-supervision remains un