The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

通过引入多图像关系基准（MIRB），我们评估了视觉语言模型（VLMs）在比较、分析和推理多个图像时的能力，并发现开源VLMs在单图像任务中接近GPT-4V的性能，但在多图像推理任务中存在显著的性能差距。我们的发现表明，即使是最先进的GPT-4V模型在我们的基准测试中也存在困难，强调了该领域进一步研究和开发的必要性。我们相信我们的MIRB可以作为开发下一代多模态模型的测试平台。

视觉与语言模型中的多图像理解基准测试：感知、知识、推理和多跳推理