Generative models for multimodal data permit the identification of latent
factors that may be associated with important determinants of observed data
heterogeneity. Common or shared factors could be important for explaining
variation across modalities whereas other factors may be private and important
only for the explanation of a single modality. Multimodal Variational
Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those
underlying latent factors and separating shared variation from private. In this
work, we investigate their capability to reliably perform this disentanglement.
In particular, we highlight a challenging problem setting where
modality-specific variation dominates the shared signal. Taking a cross-modal
prediction perspective, we demonstrate limitations of existing models, and
propose a modification how to make them more robust to modality-specific
variation. Our findings are supported by experiments on synthetic as well as
various real-world multi-omics data sets.

用于多模态数据的生成模型可用于识别与观察数据异质性重要决定因素相关的潜在因素。然而，存在一些变量是特定于单个模态的私有变量，而共享变量对解释多模态数据的变异性很重要。本研究探讨了多模态变分自编码器在可靠地实现这种解缠的能力方面，针对一种挑战性的问题设置，其中模态特定变异占主导地位，并提出了一种修改方法，使其对模态特定变异更加鲁棒。我们的发现得到了合成数据和多种真实世界多组学数据集的实验证实支持。

多模式变分自编码器中共享和私有潜在因素的解耦

Disentangling shared and private latent factors in multimodal  Variational Autoencoders

Visual and audio modalities are highly correlated, yet they contain different
information. Their strong correlation makes it possible to predict the
semantics of one from the other with good accuracy. Their intrinsic differences
make cross-modal prediction a potentially more rewarding pretext task for
self-supervised learning of video and audio representations compared to
within-modality learning. Based on this intuition, we propose Cross-Modal Deep
Clustering (XDC), a novel self-supervised method that leverages unsupervised
clustering in one modality (e.g., audio) as a supervisory signal for the other
modality (e.g., video). This cross-modal supervision helps XDC utilize the
semantic correlation and the differences between the two modalities. Our
experiments show that XDC outperforms single-modality clustering and other
multi-modal variants. XDC achieves state-of-the-art accuracy among
self-supervised methods on multiple video and audio benchmarks. Most
importantly, our video model pretrained on large-scale unlabeled data
significantly outperforms the same model pretrained with full-supervision on
ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best
of our knowledge, XDC is the first self-supervised learning method that
outperforms large-scale fully-supervised pretraining for action recognition on
the same architecture.

实现了基于交叉模态预测、自监督学习和深度聚类的方法，通过将一种模态的非监督聚类用作对另一种模态的监督信号来利用视觉和音频之间的语义相关性和差异，实现了在多个视频和音频数据集上优于其他方法的预训练模型，特别是通过仅使用大规模无标签数据预训练的视频模型，相比使用 ImageNet 和 Kinetics 数据进行了全监督预训练的同一架构，更显著地提高了在 HMDB51 和 UCF101 上的动作识别精度。