Audiovisual representation learning typically relies on the correspondence
between sight and sound. However, there are often multiple audio tracks that
can correspond with a visual scene. Consider, for example, different
conversations on the same crowded street. The effect of such counterfactual
pairs on audiovisual representation learning has not been previously explored.
To investigate this, we use dubbed versions of movies to augment cross-modal
contrastive learning. Our approach learns to represent alternate audio tracks,
differing only in speech content, similarly to the same video. Our results show
that dub-augmented training improves performance on a range of auditory and
audiovisual tasks, without significantly affecting linguistic task performance
overall. We additionally compare this approach to a strong baseline where we
remove speech before pretraining, and find that dub-augmented training is more
effective, including for paralinguistic and audiovisual tasks where speech
removal leads to worse performance. These findings highlight the importance of
considering speech variation when learning scene-level audiovisual
correspondences and suggest that dubbed audio can be a useful augmentation
technique for training audiovisual models toward more robust performance.

研究了在音频与视觉之间进行对应时，出现多个音频轨道时的学习效果，探讨了使用配音版本来增加跨模态对比学习的方法，提出了考虑语音变化时学习场景级别的音频视觉对应关系的重要性，并表明配音可以作为训练音频视觉模型的一种有用增强技术。

看似相似，听起来不同：利用反事实的跨模态样本进行视听表示学习

Looking Similar, Sounding Different: Leveraging Counterfactual  Cross-Modal Pairs for Audiovisual Representation Learning

In this paper we propose Flowtron: an autoregressive flow-based generative
network for text-to-speech synthesis with control over speech variation and
style transfer. Flowtron borrows insights from IAF and revamps Tacotron in
order to provide high-quality and expressive mel-spectrogram synthesis.
Flowtron is optimized by maximizing the likelihood of the training data, which
makes training simple and stable. Flowtron learns an invertible mapping of data
to a latent space that can be manipulated to control many aspects of speech
synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores
(MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech
quality. In addition, we provide results on control of speech variation,
interpolation between samples and style transfer between speakers seen and
unseen during training. Code and pre-trained models will be made publicly
available at this https URL

本文提出了一种自回归基于流的生成网络 Flowtron，用于对语音的合成，并提供了控制语音变化和风格转移的功能。Flowtron 通过最大化训练数据的可能性进行优化，学习将数据映射到一个潜在空间，可以操纵语音合成的许多方面。与现有模型进行比较得出，Flowtron 在语音质量上与最先进的 TTS 模型相匹配。