Spurious bias, a tendency to use spurious correlations between non-essential
input attributes and target variables for predictions, has revealed a severe
robustness pitfall in deep learning models trained on single modality data.
Multimodal Large Language Models (MLLMs), which integrate both vision and
language models, have demonstrated strong capability in joint vision-language
understanding. However, whether spurious biases are prevalent in MLLMs remains
under-explored. We mitigate this gap by analyzing the spurious biases in a
multimodal setting, uncovering the specific test data patterns that can
manifest this problem when biases in the vision model cascade into the
alignment between visual and text tokens in MLLMs. To better understand this
problem, we introduce MM-SpuBench, a comprehensive visual question-answering
(VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct
categories of spurious correlations from five open-source image datasets. The
VQA dataset is built from human-understandable concept information
(attributes). Leveraging this benchmark, we conduct a thorough evaluation of
current state-of-the-art MLLMs. Our findings illuminate the persistence of the
reliance on spurious correlations from these models and underscore the urge for
new methodologies to mitigate spurious biases. To support the MLLM robustness
research, we release our VQA benchmark at
this https URL

在深度学习模型中，单一模态数据的训练容易导致假的偏见，而多模态大型语言模型（MLLMs）在综合视觉和语言模型方面展示了强大的能力。本文分析了 MLLMs 中的假偏见，揭示了当视觉模型中的偏见影响 MLLMs 中视觉和文本符号之间的对齐时，特定的测试数据模式会表现出这一问题，并通过引入 MM-SpuBench、一个全面的视觉问答（VQA）评估基准，从五个开源图像数据集中评估了现有最先进的 MLLMs。我们的研究结果显示了这些模型对于假关联的依赖性的持久存在，并强调了减轻假的偏见的新方法的迫切性。为了支持 MLLMs 的稳健性研究，我们在该网址发布了我们的 VQA 基准。

MM-SpuBench: 对多模态 LLMs 中偶发偏见的更好理解

MM-SpuBench: Towards Better Understanding of Spurious Biases in  Multimodal LLMs

We introduce a novel visual question answering (VQA) task in the context of
autonomous driving, aiming to answer natural language questions based on
street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving
scenario presents more challenges. Firstly, the raw visual data are
multi-modal, including images and point clouds captured by camera and LiDAR,
respectively. Secondly, the data are multi-frame due to the continuous,
real-time acquisition. Thirdly, the outdoor scenes exhibit both moving
foreground and static background. Existing VQA benchmarks fail to adequately
address these complexities. To bridge this gap, we propose NuScenes-QA, the
first benchmark for VQA in the autonomous driving scenario, encompassing 34K
visual scenes and 460K question-answer pairs. Specifically, we leverage
existing 3D detection annotations to generate scene graphs and design question
templates manually. Subsequently, the question-answer pairs are generated
programmatically based on these templates. Comprehensive statistics prove that
our NuScenes-QA is a balanced large-scale benchmark with diverse question
formats. Built upon it, we develop a series of baselines that employ advanced
3D detection and VQA techniques. Our extensive experiments highlight the
challenges posed by this new task. Codes and dataset are available at
this https URL

我们介绍了一个新颖的视觉问答（VQA）任务，旨在回答基于街景线索的自然语言问题，在自动驾驶情境下。我们提出了 NuScenes-QA，这是第一个针对自动驾驶场景下的 VQA 任务的基准，包括 34K 个视觉场景和 460K 个问题 - 答案对。我们利用现有的 3D 检测注释生成场景图，并手动设计问题模板。这个基准是一个平衡的大规模基准，具有多种问题格式。

NuScenes-QA: 一个针对自主驾驶场景的多模态视觉问答基准测试

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for  Autonomous Driving Scenario

This paper presents a new baseline for visual question answering task. Given
an image and a question in natural language, our model produces accurate
answers according to the content of the image. Our model, while being
architecturally simple and relatively small in terms of trainable parameters,
sets a new state of the art on both unbalanced and balanced VQA benchmark. On
VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the
test-standard set without using additional data, an improvement of 0.4% over
state of the art, and on newly released VQA 2.0, our model scores 59.7% on
validation set outperforming best previously reported results by 0.5%. The
results presented in this paper are especially interesting because very similar
models have been tried before but significantly lower performance were
reported. In light of the new results we hope to see more meaningful research
on visual question answering in the future.

本文介绍了一种新的视觉问答任务的基线模型，它可以根据图像的内容和自然语言的问题准确地产生答案，并取得了在不平衡和平衡的 VQA 基准测试中的最新成果。