In this paper we introduce a new synchronisation task, Gesture-Sync:
determining if a person's gestures are correlated with their speech or not. In
comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far
looser relationship between the voice and body movement than there is between
voice and lip motion. We introduce a dual-encoder model for this task, and
compare a number of input representations including RGB frames, keypoint
images, and keypoint vectors, assessing their performance and advantages. We
show that the model can be trained using self-supervised learning alone, and
evaluate its performance on the LRS3 dataset. Finally, we demonstrate
applications of Gesture-Sync for audio-visual synchronisation, and in
determining who is the speaker in a crowd, without seeing their faces. The
code, datasets and pre-trained models can be found at:
https://www.robots.ox.ac.uk/~vgg/research/gestsync.

本文介绍了一项新的同步任务 Gesture-Sync：确定人的手势与他们的语言之间是否存在相关性。我们引入了一种双编码器模型进行这项任务，并比较了包括 RGB 帧、关键点图像和关键点向量在内的多种输入表示形式，评估了它们的性能和优势。我们展示了模型可以仅通过自监督学习进行训练，并在 LRS3 数据集上评估其性能。最后，我们展示了使用 Gesture-Sync 在音频 - 视觉同步和在人群中确定发言者的应用。代码、数据集和预训练模型可在以下网址找到：https://www.robots.ox.ac.uk/~vgg/research/gestsync。

GestSync：确定非言语角色的发言人

GestSync: Determining who is speaking without a talking head

The objective of this paper is audio-visual synchronisation of general videos
'in the wild'. For such videos, the events that may be harnessed for
synchronisation cues may be spatially small and may occur only infrequently
during a many seconds-long video clip, i.e. the synchronisation signal is
'sparse in space and time'. This contrasts with the case of synchronising
videos of talking heads, where audio-visual correspondence is dense in both
time and space.
We make four contributions: (i) in order to handle longer temporal sequences
required for sparse synchronisation signals, we design a multi-modal
transformer model that employs 'selectors' to distil the long audio and visual
streams into small sequences that are then used to predict the temporal offset
between streams. (ii) We identify artefacts that can arise from the compression
codecs used for audio and video and can be used by audio-visual models in
training to artificially solve the synchronisation task. (iii) We curate a
dataset with only sparse in time and space synchronisation signals; and (iv)
the effectiveness of the proposed model is shown on both dense and sparse
datasets quantitatively and qualitatively.
Project page: v-iashin.github.io/SparseSync

探讨基于多模态转换模型处理视频的音视频同步问题，提出使用视频流选择器将长时间的音视频流裁剪成小的序列并使用它们来预测两个流之间的时间偏移。通过构建数据集和解决压缩编解码器带来的问题，验证了该方法在稀疏和密集同步数据集上的优越性。

可训练的选择器：稀疏时空音视频同步

Sparse in Space and Time: Audio-visual Synchronisation with Trainable  Selectors

In this paper, we consider the problem of audio-visual synchronisation
applied to videos `in-the-wild' (ie of general classes beyond speech). As a new
task, we identify and curate a test set with high audio-visual correlation,
namely VGG-Sound Sync. We compare a number of transformer-based architectural
variants specifically designed to model audio and visual signals of arbitrary
length, while significantly reducing memory requirements during training. We
further conduct an in-depth analysis on the curated dataset and define an
evaluation metric for open domain audio-visual synchronisation. We apply our
method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations
on various aspects. Finally, we set the first benchmark for general
audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound
Sync video dataset. In all cases, our proposed model outperforms the previous
state-of-the-art by a significant margin.

本研究提出基于 transformer 的架构和度量标准用于评估各种类别下的音频 - 视频同步，并使用新的 VGG-Sound Sync 数据集测试。结果表明，我们的模型优于先前的最先进技术。