The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at https://github.com/donghao51/MOOSA.

本研究提出了一种利用自我监督方法解决多模态开放领域泛化（MM-OSDG）问题的新途径，引入了两个创新的多模态自我监督预训练任务：遮蔽跨模态翻译和多模态拼图。这些任务有助于学习多模态代表性特征，提高泛化和开放类别检测能力，并提出一种新颖的熵权重机制来平衡不同模态的损失。此外，我们还扩展了该方法以解决多模态开放领域自适应（MM-OSDA）问题。实验证明了该方法在多个数据集上的有效性和多样性。

面向多模态开放领域泛化和自适应的自监督方法