We propose a self-supervised learning approach for videos that learns
representations of both the RGB frames and the accompanying audio without human
supervision. In contrast to images that capture the static scene appearance,
videos also contain sound and temporal scene dynamics. To l