Recently, Pretrained Language Models (PLMs) have been serving as
general-purpose interfaces, posing a significant demand for comprehensive
visual knowledge. However, it remains unclear how well current PLMs and their
visually augmented counterparts (VaLMs) can master visual commonsense
knowledge. To investigate this, we propose ImageNetVC, a fine-grained,
human-annotated dataset specifically designed for zero-shot visual commonsense
evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve
into the fundamental visual commonsense knowledge of both unimodal PLMs and
VaLMs, uncovering the scaling law and the influence of the backbone model on
VaLMs. Furthermore, we investigate the factors affecting the visual commonsense
knowledge of large-scale models, providing insights into the development of
language models enriched with visual commonsense knowledge. Our code and
dataset are available at this https URL

本文利用人为标注的数据集 ImageNetVC，探究了先前被作为通用接口使用的 预训练语言模型（PLMs）和其带视觉增强的对应模型（VaLMs）的视觉常识知识掌握情况及其影响因素。同时，通过研究大规模模型的因素，提供了对视觉常识知识丰富的自然语言模型发展的启示。

ImageNetVC：1000 个 ImageNet 类别上的零样本视觉常识评估

ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet  Categories

There are limitations in learning language from text alone. Therefore, recent
focus has been on developing multimodal models. However, few benchmarks exist
that can measure what language models learn about language from multimodal
training. We hypothesize that training on a visual modality should improve on
the visual commonsense knowledge in language models. Therefore, we introduce
two evaluation tasks for measuring visual commonsense knowledge in language
models and use them to evaluate different multimodal models and unimodal
baselines. Primarily, we find that the visual commonsense knowledge is not
significantly different between the multimodal models and unimodal baseline
models trained on visual text data.

研究利用多模态模型来学习语言的局限性，提出了两个评估任务来衡量语言模型在视觉常识知识方面的表现。结果发现，基于视觉文本数据的多模态模型和单模态模型在视觉常识知识方面表现不显著不同。