In this paper, we present a multimodal \textit{and} dynamical VAE (MDVAE)
applied to unsupervised audio-visual speech representation learning. The latent
space is structured to dissociate the latent dynamical factors that are shared
between the modalities from those that are specific to each modality. A static
latent variable is also introduced to encode the information that is constant
over time within an audiovisual speech sequence. The model is trained in an
unsupervised manner on an audiovisual emotional speech dataset, in two stages.
In the first stage, a vector quantized VAE (VQ-VAE) is learned independently
for each modality, without temporal modeling. The second stage consists in
learning the MDVAE model on the intermediate representation of the VQ-VAEs
before quantization. The disentanglement between static versus dynamical and
modality-specific versus modality-common information occurs during this second
training stage. Extensive experiments are conducted to investigate how
audiovisual speech latent factors are encoded in the latent space of MDVAE.
These experiments include manipulating audiovisual speech, audiovisual facial
image denoising, and audiovisual speech emotion recognition. The results show
that MDVAE effectively combines the audio and visual information in its latent
space. They also show that the learned static representation of audiovisual
speech can be used for emotion recognition with few labeled data, and with
better accuracy compared with unimodal baselines and a state-of-the-art
supervised model based on an audiovisual transformer architecture.

本文介绍了一个多模态和动态 VAE（MDVAE），应用于无监督学习音频 - 视觉语音表示。实施时，结构化的潜在空间旨在将共享于两种模态之间的动态潜在因素与各自模态的动态和静态信息分离，采用两阶段训练方法，并通过对音频 - 视觉数据集进行实验来证明此模型在音频 - 视觉信息的无监督学习中具有良好的性能。