The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.

提出了GAOKAO-MM，这是一个基于中国高考的多模态基准，评估了10个大型视觉语言模型(LVLMs)，发现它们的准确率都低于50％，排名前三的是GPT-4-Vison（48.1％），Qwen-VL-Plus（41.2％）和Gemini-Pro-Vision（35.1％）。多维分析结果表明LVLMs在人工通用智能(AGI)方面有适度的距离，并为多语言LVLMs的发展提供了启示。

GAOKAO-MM: 中国多模态模型评估的人类水平基准