The rapid development of Multi-modality Large Language Models (MLLMs) has
navigated a paradigm shift in computer vision, moving towards versatile
foundational models. However, evaluating MLLMs in low-level visual perception
and understanding remains a yet-to-explore domain. To this end, we design
benchmark settings to emulate human language responses related to low-level
vision: the low-level visual perception (A1) via visual question answering
related to low-level attributes (e.g. clarity, lighting); and the low-level
visual description (A2), on evaluating MLLMs for low-level text descriptions.
Furthermore, given that pairwise comparison can better avoid ambiguity of
responses and has been adopted by many human experiments, we further extend the
low-level perception-related question-answering and description evaluations of
MLLMs from single images to image pairs. Specifically, for perception (A1), we
carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999
image pairs each accompanied by an open-ended question about its low-level
features; for description (A2), we propose the LLDescribe+ dataset, evaluating
MLLMs for low-level descriptions on 499 single images and 450 pairs.
Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting
score, by employing a softmax-based approach to enable all MLLMs to generate
quantifiable quality ratings, tested against human opinions in 7 image quality
assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that
several MLLMs have decent low-level visual competencies on single images, but
only GPT-4V exhibits higher accuracy on pairwise comparisons than single image
evaluations (like humans). We hope that our benchmark will motivate further
research into uncovering and enhancing these nascent capabilities of MLLMs.
Datasets will be available at this https URL

通过设计基准测试，评估多模态大型语言模型 (MLLMs) 在低层次视觉感知和理解方面的能力，并将低层次视觉感知和描述的评估从单一图像扩展到图像对。研究发现，多个 MLLMs 在单一图像上表现出不错的低层次视觉能力，但只有 GPT-4V 在图像对的配对比较中表现出比单一图像评估更高的准确性（类似于人类）。希望这个基准测试能够激发进一步研究，揭示和增强 MLLMs 的新兴能力。

低级视觉上多模态基础模型的基准：从单图像到图像对

A Benchmark for Multi-modal Foundation Models on Low-level Vision: from  Single Images to Pairs

Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
this https URL

基于 GPT-4V 的多模态基础模型，在低级视觉感知和理解任务方面带来了新的范式，可以对多种自然人类指令做出响应。通过大规模的主观实验收集了大量关于低级视觉的真实人类反馈，建立了包含 58K 个详细反馈的 Q-Pathway 数据集，实验结果表明，Q-Instruct 能够提升多个基础模型在低级感知和理解能力方面的表现，我们的数据集和模型展示可在所发布的网址获取。

Q-Instruct: 提升多模态基础模型的低层视觉能力

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality  Foundation Models

The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess fundamental low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs.

通过构建低层视觉感知、低层视觉描述和视觉质量评估三个领域的综合基准，评估了多模式大型语言模型在低层视觉感知和理解方面的能力，并发现其具有基本的低层视觉技能，但这些技能仍不稳定和相对不精确，需要针对这些能力进行特定的增强。