Converting different modalities into general text, serving as input prompts for large language models (LLMs), is a common method to align multimodal models when there is limited pairwise data. This text-centric approach leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation. This enables downstream models to effectively interpret various modal inputs. This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities, revealing that current text-centric alignment methods compromise downstream robustness. To address this issue, we propose a new text-centric approach that achieves superior robustness compared to previous methods across various modalities in different settings. Our findings highlight the potential of this approach to enhance the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.

将不同的模态转化为常规文本作为大型语言模型（LLMs）的输入提示，以解决多模态模型对小的成对数据的对齐问题，并评估了当前文本为中心的对齐方法在缺失数据、噪声或缺少模态下的质量和稳健性，提出了一种新的文本为中心的方法，具有出色的稳健性和适应性，为动态和现实世界应用提供了有希望的解决方案。

提升以文本为中心的多模态对齐的鲁棒性