Multipanel images, commonly seen as web screenshots, posters, etc., pervade
our daily lives. These images, characterized by their composition of multiple
subfigures in distinct layouts, effectively convey information to people.
Toward building advanced multimodal AI applications, such as agents that
understand complex scenes and navigate through webpages, the skill of
multipanel visual reasoning is essential, and a comprehensive evaluation of
models in this regard is important. Therefore, our paper introduces Multipanel
Visual Question Answering (MultipanelVQA), a novel benchmark that specifically
challenges models in comprehending multipanel images. The benchmark comprises
6,600 questions and answers related to multipanel images. While these questions
are straightforward for average humans, achieving nearly perfect correctness,
they pose significant challenges to the state-of-the-art Large Vision Language
Models (LVLMs) we tested. In our study, we utilized synthetically curated
multipanel images specifically designed to isolate and evaluate the impact of
diverse factors on model performance, revealing the sensitivity of LVLMs to
various interferences in multipanel images, such as adjacent subfigures and
layout complexity. As a result, MultipanelVQA highlights the need and direction
for improving LVLMs' ability to understand complex visual-language contexts.
Code and data are released at this https URL

通过介绍 Multipanel Visual Question Answering (MultipanelVQA) 基准测试，本研究揭示了 Large Vision Language Models (LVLMs) 对于多子图像的理解存在的挑战，并强调了改进 LVLMs 在理解复杂视觉语境方面的需求和方向。

面包或吉娃娃？用多面板视觉语言模型挑战性大的 VQA 任务

Muffin or Chihuahua? Challenging Large Vision-Language Models with  Multipanel VQA

Replicating the innate human ability to detect all objects based on free-form
texts at any granularity remains a formidable challenge for Vision-Language
models. Current Large Vision Language Models (LVLMs) are predominantly
constrained to grounding a single, pre-existing object, relying solely on data
from Referring Expression Comprehension tasks. The limitation leads to a
compromise in model design, necessitating the introduction of visual expert
models or the integration of customized head structures. Beyond these
constraints, our research delves into the untapped potential of LVLMs and
uncover their inherent capability for basic object perception, allowing them to
accurately identify and locate objects of interest. Building on this insight,
we introduce a novel language-prompted localization dataset designed to fully
unleash the capabilities of LVLMs in integrating fine-grained object perception
with precise location awareness. More importantly, we present
$\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the
introduction of any special tokens, expert models, or additional detection
modules. It simply maintains a consistent structure with popular LVLMs by
unifying data formats across various localization-related scenarios and is
trained end-to-end through a well-designed pipeline. Comprehensive experiments
demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art
performance on the fine-grained RefCOCO series but also approaches the
capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.

基于大规模视觉语言模型的对象感知与定位能力，我们引入一个新颖的语言提示定位数据集并提出了一种纯粹基于 LVLM 的基准模型 Griffon，该模型在细粒度的 RefCOCO 系列上达到了最先进的性能，并接近于专家模型 Faster RCNN 在检测基准 MSCOCO 上的能力。