The rapid development of Multi-modality Large Language Models (MLLMs) has
navigated a paradigm shift in computer vision, moving towards versatile
foundational models. However, evaluating MLLMs in low-level visual perception
and understanding remains a yet-to-explore domain. To this end, we design
benchmark settings to emulate human language responses related to low-level
vision: the low-level visual perception (A1) via visual question answering
related to low-level attributes (e.g. clarity, lighting); and the low-level
visual description (A2), on evaluating MLLMs for low-level text descriptions.
Furthermore, given that pairwise comparison can better avoid ambiguity of
responses and has been adopted by many human experiments, we further extend the
low-level perception-related question-answering and description evaluations of
MLLMs from single images to image pairs. Specifically, for perception (A1), we
carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999
image pairs each accompanied by an open-ended question about its low-level
features; for description (A2), we propose the LLDescribe+ dataset, evaluating
MLLMs for low-level descriptions on 499 single images and 450 pairs.
Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting
score, by employing a softmax-based approach to enable all MLLMs to generate
quantifiable quality ratings, tested against human opinions in 7 image quality
assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that
several MLLMs have decent low-level visual competencies on single images, but
only GPT-4V exhibits higher accuracy on pairwise comparisons than single image
evaluations (like humans). We hope that our benchmark will motivate further
research into uncovering and enhancing these nascent capabilities of MLLMs.
Datasets will be available at this https URL

通过设计基准测试，评估多模态大型语言模型 (MLLMs) 在低层次视觉感知和理解方面的能力，并将低层次视觉感知和描述的评估从单一图像扩展到图像对。研究发现，多个 MLLMs 在单一图像上表现出不错的低层次视觉能力，但只有 GPT-4V 在图像对的配对比较中表现出比单一图像评估更高的准确性（类似于人类）。希望这个基准测试能够激发进一步研究，揭示和增强 MLLMs 的新兴能力。

低级视觉上多模态基础模型的基准：从单图像到图像对

A Benchmark for Multi-modal Foundation Models on Low-level Vision: from  Single Images to Pairs

The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess fundamental low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs.

通过构建低层视觉感知、低层视觉描述和视觉质量评估三个领域的综合基准，评估了多模式大型语言模型在低层视觉感知和理解方面的能力，并发现其具有基本的低层视觉技能，但这些技能仍不稳定和相对不精确，需要针对这些能力进行特定的增强。