In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: \url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.

本文介绍了一项新的同步任务 Gesture-Sync：确定人的手势与他们的语言之间是否存在相关性。我们引入了一种双编码器模型进行这项任务，并比较了包括RGB帧、关键点图像和关键点向量在内的多种输入表示形式，评估了它们的性能和优势。我们展示了模型可以仅通过自监督学习进行训练，并在LRS3数据集上评估其性能。最后，我们展示了使用 Gesture-Sync 在音频-视觉同步和在人群中确定发言者的应用。代码、数据集和预训练模型可在以下网址找到：https://www.robots.ox.ac.uk/~vgg/research/gestsync。

GestSync：确定非言语角色的发言人