Commonsense reasoning is fundamentally based on multimodal knowledge.
However, existing large language models (LLMs) are primarily trained using
textual data only, limiting their ability to incorporate essential visual
information. In contrast, Visual Language Models, which excel at
visually-oriented tasks, often fail at non-visual tasks such as basic
commonsense reasoning. This divergence highlights a critical challenge - the
integration of robust visual understanding with foundational text-based
language reasoning. To this end, we introduce a method aimed at enhancing LLMs'
visual commonsense. Specifically, our method generates multiple images based on
the input text prompt and integrates these into the model's decision-making
process by mixing their prediction probabilities. To facilitate multimodal
grounded language modeling, we employ a late-fusion layer that combines the
projected visual features with the output of a pre-trained LLM conditioned on
text only. This late-fusion layer enables predictions based on comprehensive
image-text knowledge as well as text only when this is required. We evaluate
our approach using several visual commonsense reasoning tasks together with
traditional NLP tasks, including common sense reasoning and reading
comprehension. Our experimental results demonstrate significant superiority
over existing baselines. When applied to recent state-of-the-art LLMs (e.g.,
Llama3), we observe improvements not only in visual common sense but also in
traditional NLP benchmarks. Code and models are available under
this https URL

基于多模态知识的常识推理是根本，我们介绍了一种方法来增强大型语言模型的视觉常识能力，该方法通过生成多个图像并将其与模型的决策过程相融合来提供综合的图像和文本知识。这种方法在不仅在视觉常识上，还在传统自然语言处理基准上优于现有基线模型。

通过多图像生成改善语言模型中的视觉常识

Improving Visual Commonsense in Language Models via Multiple Image  Generation

Visual commonsense contains knowledge about object properties, relationships,
and behaviors in visual data. Discovering visual commonsense can provide a more
comprehensive and richer understanding of images, and enhance the reasoning and
decision-making capabilities of computer vision systems. However, the visual
commonsense defined in existing visual commonsense discovery studies is
coarse-grained and incomplete. In this work, we draw inspiration from a
commonsense knowledge base ConceptNet in natural language processing, and
systematically define the types of visual commonsense. Based on this, we
introduce a new task, Visual Commonsense Discovery (VCD), aiming to extract
fine-grained commonsense of different types contained within different objects
in the image. We accordingly construct a dataset (VCDD) from Visual Genome and
ConceptNet for VCD, featuring over 100,000 images and 14 million
object-commonsense pairs. We furthermore propose a generative model (VCDM) that
integrates a vision-language model with instruction tuning to tackle VCD.
Automatic and human evaluations demonstrate VCDM's proficiency in VCD,
particularly outperforming GPT-4V in implicit commonsense discovery. The value
of VCD is further demonstrated by its application to two downstream tasks,
including visual commonsense evaluation and visual question answering. The data
and code will be made available on GitHub.

本研究通过借鉴自然语言处理中常识知识库 ConceptNet 的方法，系统定义了视觉常识的各种类型，并提出了一种新的任务 - 视觉常识发现（VCD），旨在提取图像中不同对象包含的细粒度常识。通过构建包括超过 10 万张图像和 1400 万个对象 - 常识对的数据集（VCDD），并提出了一种将视觉 - 语言模型与指令调整相结合的生成模型（VCDM），其在 VCD 中表现出色，尤其在隐含常识发现方面优于 GPT-4V。VCD 的价值进一步得到了两个下游任务的应用验证，包括视觉常识评估和视觉问答。数据和代码在 GitHub 上可获得。

基于知识库的图像视觉常识发现

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images

Weird, unusual, and uncanny images pique the curiosity of observers because
they challenge commonsense. For example, an image released during the 2022
world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo
playing chess, which playfully violates our expectation that their competition
should occur on the football field. Humans can easily recognize and interpret
these unconventional images, but can AI models do the same? We introduce
WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is
comprised of purposefully commonsense-defying images created by designers using
publicly-available image generation tools like Midjourney. We consider several
tasks posed over the dataset. In addition to image captioning, cross-modal
matching, and visual question answering, we introduce a difficult explanation
generation task, where models must identify and explain why a given image is
unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2
still lag behind human performance on WHOOPS!. We hope our dataset will inspire
the development of AI models with stronger visual commonsense reasoning
abilities. Data, models and code are available at the project website:
this http URL

介绍了一种名为 WHOOPS！的新视觉常识数据集和基准，其中包括几种面向该数据集的任务，包括图像字幕，跨模式匹配，视觉问答和解释生成任务。结果表明，目前最先进的 AI 模型仍然落后于人类在 WHOOPS！上的表现，希望这个数据集能够激发开发更强的视觉常识推理能力的 AI 模型的灵感。