Modern neural language models (LMs) are powerful tools for modeling human
sentence production and comprehension, and their internal representations are
remarkably well-aligned with representations of language in the human brain.
But to achieve these results, LMs must be trained in distinctly un-human-like
ways -- requiring orders of magnitude more language data than children receive
during development, and without any of the accompanying grounding in
perception, action, or social behavior. Do models trained more naturalistically
-- with grounded supervision -- exhibit more human-like language learning? We
investigate this question in the context of word learning, a key sub-task in
language acquisition. We train a diverse set of LM architectures, with and
without auxiliary supervision from image captioning tasks, on datasets of
varying scales. We then evaluate these models on a broad set of benchmarks
characterizing models' learning of syntactic categories, lexical relations,
semantic features, semantic similarity, and alignment with human neural
representations. We find that visual supervision can indeed improve the
efficiency of word learning. However, these improvements are limited: they are
present almost exclusively in the low-data regime, and sometimes canceled out
by the inclusion of rich distributional signals from text. The information
conveyed by text and images is not redundant -- we find that models mainly
driven by visual information yield qualitatively different from those mainly
driven by word co-occurrences. However, our results suggest that current
multi-modal modeling approaches fail to effectively leverage visual information
to build more human-like word representations from human-sized datasets.

通过对具有意义的监督的视觉数据进行训练，我们发现在具有限定语言数据的情况下，视觉监督可以提高词汇学习的效率，但这种改进是有限的，并且当前的多模态建模方法未能有效利用视觉信息以构建更具人类特征的词汇表示。

视觉基准帮助在低数据环境中学习词义

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Enabling effective brain-computer interfaces requires understanding how the
human brain encodes stimuli across modalities such as visual, language (or
text), etc. Brain encoding aims at constructing fMRI brain activity given a
stimulus. There exists a plethora of neural encoding models which study brain
encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained
language models). Few recent papers have also obtained separate visual and text
representation models and performed late-fusion using simple heuristics.
However, previous work has failed to explore: (a) the effectiveness of image
Transformer models for encoding visual stimuli, and (b) co-attentive
multi-modal modeling for visual and text reasoning. In this paper, we
systematically explore the efficacy of image Transformers (ViT, DEiT, and BEiT)
and multi-modal Transformers (VisualBERT, LXMERT, and CLIP) for brain encoding.
Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide
the following insights. (1) To the best of our knowledge, we are the first to
investigate the effectiveness of image and multi-modal Transformers for brain
encoding. (2) We find that VisualBERT, a multi-modal Transformer, significantly
outperforms previously proposed single-mode CNNs, image Transformers as well as
other previously proposed multi-modal models, thereby establishing new
state-of-the-art. The supremacy of visio-linguistic models raises the question
of whether the responses elicited in the visual regions are affected implicitly
by linguistic processing even when passively viewing images. Future fMRI tasks
can verify this computational insight in an appropriate experimental setting.

本文系统探讨了图像转换器和多模态转换器在大脑编码方面的有效性，发现多模态转换器 VisualBERT 在编码上远优于之前提出的单模态 CNN、图像转换器以及其他先前提出的多模态模型，这表明视觉语言模型的优越性，产生了人们是否在被动地查看图像时，视觉区域的响应是否受到语言处理的影响的问题。