The advancement of large language models (LLMs) has significantly broadened
the scope of applications in natural language processing, with multi-modal LLMs
extending these capabilities to integrate and interpret visual data. However,
existing benchmarks for visual language models (VLMs) predominantly focus on
single-image inputs, neglecting the crucial aspect of multi-image
understanding. In this paper, we introduce a Multi-Image Relational Benchmark
MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across
multiple images. Our benchmark encompasses four categories: perception, visual
world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive
evaluation of a wide range of open-source and closed-source models, we
demonstrate that while open-source VLMs were shown to approach the performance
of GPT-4V in single-image tasks, a significant performance gap remains in
multi-image reasoning tasks. Our findings also reveal that even the
state-of-the-art GPT-4V model struggles with our benchmark, underscoring the
need for further research and development in this area. We believe our
contribution of MIRB could serve as a testbed for developing the
next-generation multi-modal models.

通过引入多图像关系基准（MIRB），我们评估了视觉语言模型（VLMs）在比较、分析和推理多个图像时的能力，并发现开源 VLMs 在单图像任务中接近 GPT-4V 的性能，但在多图像推理任务中存在显著的性能差距。我们的发现表明，即使是最先进的 GPT-4V 模型在我们的基准测试中也存在困难，强调了该领域进一步研究和开发的必要性。我们相信我们的 MIRB 可以作为开发下一代多模态模型的测试平台。

视觉与语言模型中的多图像理解基准测试：感知、知识、推理和多跳推理

Benchmarking Multi-Image Understanding in Vision and Language Models:  Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

We introduce MuirBench, a comprehensive benchmark that focuses on robust
multi-image understanding capabilities of multimodal LLMs. MuirBench consists
of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that
involve 10 categories of multi-image relations (e.g., multiview, temporal
relations). Comprising 11,264 images and 2,600 multiple-choice questions,
MuirBench is created in a pairwise manner, where each standard instance is
paired with an unanswerable variant that has minimal semantic differences, in
order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our
results reveal that even the best-performing models like GPT-4o and Gemini Pro
find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy.
Open-source multimodal LLMs trained on single images can hardly generalize to
multi-image questions, hovering below 33.3% in accuracy. These results
highlight the importance of MuirBench in encouraging the community to develop
multimodal LLMs that can look beyond a single image, suggesting potential
pathways for future improvements.

MuirBench 是一个全面的基准测试，侧重于多模式 LLM 的强大的多图像理解能力。MuirBench 由 12 个不同的多图像任务（如场景理解，排序）组成，涉及 10 个多图像关系类别（如多视图关系，时间关系）。通过评估 20 种最新的多模态 LLMs，结果显示即使在表现最佳的模型 GPT-4o 和 Gemini Pro 面对 MuirBench 时也面临挑战，准确率分别为 68.0% 和 49.3%。基于单个图像训练的开源多模态 LLMs 很难推广到多图像问题，准确率低于 33.3%。这些结果凸显了 MuirBench 的重要性，鼓励社区开发能够超越单个图像的多模态 LLMs，并提出未来改进的潜在途径。