The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

本研究针对多模态大语言模型在文本与图像的综合推理能力不足的问题，提出了EMMA基准，旨在评估数学、物理、化学和编程等领域的有机多模态推理。研究表明，现有模型在处理复杂的多模态及多步骤推理任务时存在显著局限，强调了提升多模态架构和训练方法的必要性，以更接近人类的推理能力。

多模态大语言模型能否进行推理？EMMA：增强的多模态推理基准