The advanced language processing abilities of large language models (LLMs)
have stimulated debate over their capacity to replicate human-like cognitive
processes. One differentiating factor between language processing in LLMs and
humans is that language input is often grounded in more than one perceptual
modality, whereas most LLMs process solely text-based information. Multimodal
grounding allows humans to integrate - e.g. visual context with linguistic
information and thereby place constraints on the space of upcoming words,
reducing cognitive load and improving perception and comprehension. Recent
multimodal LLMs (mLLMs) combine visual and linguistic embedding spaces with a
transformer type attention mechanism for next-word prediction. To what extent
does predictive language processing based on multimodal input align in mLLMs
and humans? To answer this question, 200 human participants watched short
audio-visual clips and estimated the predictability of an upcoming verb or
noun. The same clips were processed by the mLLM CLIP, with predictability
scores based on a comparison of image and text feature vectors. Eye-tracking
was used to estimate what visual features participants attended to, and CLIP's
visual attention weights were recorded. We find that human estimates of
predictability align significantly with CLIP scores, but not for a unimodal LLM
of comparable parameter size. Further, alignment vanished when CLIP's visual
attention weights were perturbed, and when the same input was fed to a
multimodal model without attention. Analysing attention patterns, we find a
significant spatial overlap between CLIP's visual attention weights and human
eye-tracking data. Results suggest that comparable processes of integrating
multimodal information, guided by attention to relevant visual features,
supports predictive language processing in mLLMs and humans.

大型语言模型（LLMs）的高级语言处理能力引发了关于它们是否能够复制类似人类认知过程的能力的讨论，本文通过研究多模态语言模型（mLLMs）中的视觉关注权重，发现与人类一样，mLLMs 中基于多模态输入的预测性语言处理过程也会受到视觉特征的注意引导。

多模态大型语言模型在预测语言处理中体现人类式的视觉 - 语言整合的证据

Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large  Language Models During Predictive Language Processing

Current captioning datasets, focus on object-centric captions, describing the
visible objects in the image, often ending up stating the obvious (for humans),
e.g. "people eating food in a park". Although these datasets are useful to
evaluate the ability of Vision & Language models to recognize the visual
content, they lack in expressing trivial abstract concepts, e.g. "people having
a picnic". Such concepts are licensed by human's personal experience and
contribute to forming common sense assumptions. We present the High-Level
Dataset; a dataset extending 14997 images of the COCO dataset with 134973
human-annotated (high-level) abstract captions collected along three axes:
scenes, actions and rationales. We describe and release such dataset and we
show how it can be used to assess models' multimodal grounding of abstract
concepts and enrich models' visio-lingusitic representations. Moreover, we
describe potential tasks enabled by this dataset involving high- and low-level
concepts interactions.

本文介绍一个新的高级数据集（High-Level Dataset），可以拓展经典 COOC 数据集，使得机器学习模型更好地理解抽象概念，并进一步提升模型的多模态融合能力。

HL 数据集：将高层语言概念与视觉相结合

HL Dataset: Grounding High-Level Linguistic Concepts in Vision

We present MUG, a novel interactive task for multimodal grounding where a
user and an agent work collaboratively on an interface screen. Prior works
modeled multimodal UI grounding in one round: the user gives a command and the
agent responds to the command. Yet, in a realistic scenario, a user command can
be ambiguous when the target action is inherently difficult to articulate in
natural language. MUG allows multiple rounds of interactions such that upon
seeing the agent responses, the user can give further commands for the agent to
refine or even correct its actions. Such interaction is critical for improving
grounding performances in real-world use cases. To investigate the problem, we
create a new dataset that consists of 77,820 sequences of human user-agent
interaction on mobile interfaces in which 20% involves multiple rounds of
interactions. To establish our benchmark, we experiment with a range of
modeling variants and evaluation strategies, including both offline and online
evaluation-the online strategy consists of both human evaluation and automatic
with simulators. Our experiments show that allowing iterative interaction
significantly improves the absolute task completion by 18% over the entire test
dataset and 31% over the challenging subset. Our results lay the foundation for
further investigation of the problem.

针对多模态界面对话交互中的语言歧义问题，本文提出了一种新的交互式任务 MUG，并构建了一个包含 77820 组人类用户和智能 Agent 交互的实验数据集，通过离线和在线策略进行评估，实验结果表明允许迭代式交互可以显著提高任务完成率。