Counterfactual reasoning, as a crucial manifestation of human intelligence,
refers to making presuppositions based on established facts and extrapolating
potential outcomes. Existing multimodal large language models (MLLMs) have
exhibited impressive cognitive and reasoning capabilities, which have been
examined across a wide range of Visual Question Answering (VQA) benchmarks.
Nevertheless, how will existing MLLMs perform when faced with counterfactual
questions? To answer this question, we first curate a novel
\textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning
benchmark, abbreviated as \textbf{CFMM}, to systematically assess the
counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six
challenging tasks, each including hundreds of carefully human-labeled
counterfactual questions, to evaluate MLLM's counterfactual reasoning
capabilities across diverse aspects. Through experiments, interestingly, we
find that existing MLLMs prefer to believe what they see, but ignore the
counterfactual presuppositions presented in the question, thereby leading to
inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs
on our proposed CFMM. The significant gap between their performance on our CFMM
and that on several VQA benchmarks indicates that there is still considerable
room for improvement in existing MLLMs toward approaching human-level
intelligence. On the other hand, through boosting MLLMs performances on our
CFMM in the future, potential avenues toward developing MLLMs with advanced
intelligence can be explored.

基于现有多模态大型语言模型 (MLLMs) 在视觉问答评测方面的认知和推理能力，我们提出了一个新的 CFMM（Counterfactual MultiModal）基准测试，以系统评估 MLLMs 的反事实推理能力，发现现有 MLLMs 往往更加倾向于相信所见而忽视问题中提到的反事实前提，因此导致了不准确的回答，同时也表明现有 MLLMs 在逼近人类智能方面仍有较大提升空间，我们还探索了通过在未来提升 MLLMs 在 CFMM 上的表现来发展具备先进智能的 MLLMs 的潜在途径。