We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

我们介绍了MMMU：一个新的基准，旨在评估多模态模型在需要大学级学科知识和深思熟虑的大规模跨学科任务上的表现。MMMU包括来自大学考试、测验和教科书的11500个精心收集的多模态问题，涵盖六个核心学科：艺术与设计、商业、科学、健康与医药、人文社会科学和技术与工程学。这些问题涵盖30个学科和183个子领域，包括30种高度异质的图像类型，如图表、图示、地图、表格、乐谱和化学结构。与现有基准不同，MMMU侧重于使用领域特定知识进行高级感知和推理，挑战模型执行类似于专家面临的任务。我们对14个开源LMM和专有的GPT-4V(ision)进行了评估，突显了MMMU所带来的巨大挑战。即使是先进的GPT-4V只能达到56％的准确率，表明有很大的改进空间。我们相信MMMU将推动社区构建面向专家人工通用智能的下一代多模态基础模型。

MMMU: 一个专家级通用人工智能的大规模多学科多模态理解与推理基准