In this paper, we consider the problem of audio-visual synchronisation
applied to videos `in-the-wild' (ie of general classes beyond speech). As a new
task, we identify and curate a test set with high audio-visual correlation,
namely VGG-Sound Sync. We compare a number of transformer-b