Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP) and three text-only models, with static (FastText) as well as contextual representations (multilingual BERT; XLM-RoBERTa). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.

本研究比较了来自不同模型的单词嵌入，在46个语义参数中识别出最受视觉语境影响的单词，并发现视觉模态效应与涉及具体性的指示语义属性最相关，但也适用于多个特定的语义类别，以及与情感相关的联想语言表达的倾向属性。

模态转移中的杠杆点：比较仅语言和多模态词表示