Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. These approaches are either tied to specific modalities or become computationally expensive as the number of input modalities increases. In this paper, we propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario. We achieve this by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities. This approach enables the model to effectively learn to leverage the information from the available modalities to compensate for the missing ones, enhancing missing modality robustness. We conduct a series of experiments with various baseline models and datasets to assess the effectiveness of this strategy. Experiments demonstrate that our approach improves robustness to different missing modality scenarios, outperforming existing methods designed for missing modalities or specific modality combinations.

本研究针对多模态学习中缺失模态导致的性能下降问题，提出了一种新方法——掩蔽模态投影（MMP），旨在训练一个对任何缺失模态场景都鲁棒的单一模型。通过在训练过程中随机掩蔽部分模态，该方法有效学习如何利用现有模态的信息来补偿缺失模态，从而显著提高了模型在不同缺失模态场景下的鲁棒性。

面向鲁棒多模态学习的掩蔽模态投影