Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.

通过引入VISLA基准测试，评估语言模型的语义和词汇理解能力，本论文揭示了现有最先进语言模型在理解语义细节方面的挑战，通过三句与图像相关的语义（不）等价任务，对视觉-语言模型和单模态语言模型进行评估，结果显示了在区分词汇和语义变化方面的困难，语言模型编码器对语义和词汇变化的敏感性大于单模态文本编码器，论文的贡献包括图像-文本和文本-文本检索任务的统一，无需微调的现成评估方法，并在词汇改动存在的条件下评估语言模型的语义（不）变化。

VISLA Benchmark: 评估嵌入对语义和词汇变化的敏感性