In this paper, we present an analytical framework and a novel metric to shed
light on the interpretation of the multimodal vision community. Our approach
involves measuring the proposed semantic variance and feature similarity across
modalities and levels, and conducting semantic and quantitative analyses
through comprehensive experiments. Specifically, we investigate the consistency
and speciality of representations across modalities, evolution rules within
each modality, and the collaboration logic used when optimizing a
multi-modality model. Our studies reveal several important findings, such as
the discrepancy in cross-modal features and the hybrid multi-modal cooperation
rule, which highlights consistency and speciality simultaneously for
complementary inference. Through our dissection and findings on multi-modal
fusion, we facilitate a rethinking of the reasonability and necessity of
popular multi-modal vision fusion strategies. Furthermore, our work lays the
foundation for designing a trustworthy and universal multi-modal fusion model
for a variety of tasks in the future.

通过提供一个分析框架和新的度量标准，我们研究了多模态视觉社区的解释。通过实验，我们调查了不同模态之间的一致性和特殊性，模态内的演化规则，以及优化多模态模型时使用的协作逻辑，并揭示了一些重要发现，这些发现有助于重新思考流行的多模态视觉融合策略的合理性和必要性，从而为未来设计一个可信赖和通用的多模态融合模型奠定了基础。

多模态视觉融合的解读

Interpretation on Multi-modal Visual Fusion

Recent advances in multimodal vision and language modeling have predominantly
focused on the English language, mostly due to the lack of multilingual
multimodal datasets to steer modeling efforts. In this work, we address this
gap and provide xGQA, a new multilingual evaluation benchmark for the visual
question answering task. We extend the established English GQA dataset to 7
typologically diverse languages, enabling us to detect and explore crucial
challenges in cross-lingual visual question answering. We further propose new
adapter-based approaches to adapt multimodal transformer-based models to become
multilingual, and -- vice versa -- multilingual models to become multimodal.
Our proposed methods outperform current state-of-the-art multilingual
multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the
accuracy remains low across the board; a performance drop of around 38 accuracy
points in target languages showcases the difficulty of zero-shot cross-lingual
transfer for this task. Our results suggest that simple cross-lingual transfer
of multimodal models yields latent multilingual multimodal misalignment,
calling for more sophisticated methods for vision and multilingual language
modeling.

本文提出了 xGQA，一个用于跨语言视觉问答任务的新的多语言评估基准，并使用适配器方法将多模型变换器模型扩展为多语言模型，结果表明简单的跨语言模型转移会导致多语言多模态失配，需要更复杂的方法来进行跨语言视觉和多语言语言建模