The intuitive interaction between the audio and visual modalities is valuable
for cross-modal self-supervised learning. This concept has been demonstrated
for generic audiovisual tasks like video action recognition and acoustic scene
classification. However, self-supervision remains under-explored for
audiovisual speech. We propose a method to learn self-supervised speech
representations from the raw audio waveform. We train a raw audio encoder by
combining audio-only self-supervision (by predicting informative audio
attributes) with visual self-supervision (by generating talking faces from
audio). The visual pretext task drives the audio representations to capture
information related to lip movements. This enriches the audio encoder with
visual information and the encoder can be used for evaluation without the
visual modality. Our method attains competitive performance with respect to
existing self-supervised audio features on established isolated word
classification benchmarks, and significantly outperforms other methods at
learning from fewer labels. Notably, our method also outperforms fully
supervised training, thus providing a strong initialization for speech related
tasks. Our results demonstrate the potential of multimodal self-supervision in
audiovisual speech for learning good audio representations.

该研究提出了一种通过结合音频自监督和视觉自监督来训练原始音频编码器生成说话者面部图像的自监督语音表示方法，从而为音频视觉语音的自监督学习提供了潜力。