The remarkable success of Large Language Models (LLMs) and instruction tuning
drives the evolution of Vision Language Models (VLMs) towards a versatile
general-purpose model. Yet, it remains unexplored whether current VLMs
genuinely possess quality object-level image understanding capabilities
determined from 'what objects are in the image?' or 'which object corresponds
to a specified bounding box?'. Our findings reveal that the image understanding
capabilities of current VLMs are strongly correlated with their zero-shot
performance on Vision Language (VL) tasks. This suggests that prioritizing
basic image understanding is crucial for VLMs to excel at VL tasks. To enhance
object-level image understanding, we propose Crayon Large Language and Vision
mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a
new visual prompt tuning scheme based on panoptic color maps. Furthermore, we
present a learning strategy of Dual QLoRA to preserve object-level image
understanding without forgetting it during visual instruction tuning, thereby
achieving a significant leap in zero-shot numerous VL benchmarks.

当前的视觉语言模型 (VLMs) 的图像理解能力与其在零样本视觉语言任务上的表现强相关。我们提出了一个新的视觉提示调整方案，即使用蜡笔提示进行指导调整，以提高对象级图像理解能力。此外，我们还提出了双重 QLoRA 学习策略，以在视觉指导调整过程中保持对象级图像理解能力，从而在零样本的多个视觉语言基准测试中取得了显著的进展。

CoLLaVO: 蜡笔大规模语言与视觉模型

CoLLaVO: Crayon Large Language and Vision mOdel

Vision language tasks, such as answering questions about or generating
captions that describe an image, are difficult tasks for computers to perform.
A relatively recent body of research has adapted the pretrained transformer
architecture introduced in \citet{vaswani2017attention} to vision language
modeling. Transformer models have greatly improved performance and versatility
over previous vision language models. They do so by pretraining models on a
large generic datasets and transferring their learning to new tasks with minor
changes in architecture and parameter values. This type of transfer learning
has become the standard modeling practice in both natural language processing
and computer vision. Vision language transformers offer the promise of
producing similar advancements in tasks which require both vision and language.
In this paper, we provide a broad synthesis of the currently available research
on vision language transformer models and offer some analysis of their
strengths, limitations and some open questions that remain.

视觉语言任务中，基于预训练的变压器架构在视觉语言建模方面表现出色，为视觉和语言结合的任务带来了类似的进展。