Humans possess the remarkable skill of Visual Perception, the ability to see
and understand the seen, helping them make sense of the visual world and, in
turn, reason. Multimodal Large Language Models (MLLM) have recently achieved
impressive performance on vision-language tasks ranging from visual
question-answering and image captioning to visual reasoning and image
generation. However, when prompted to identify or count (perceive) the entities
in a given image, existing MLLM systems fail. Working towards developing an
accurate MLLM system for perception and reasoning, we propose using Versatile
vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the
VCoder with perception modalities such as segmentation or depth maps, improving
the MLLM's perception abilities. Secondly, we leverage the images from COCO and
outputs from off-the-shelf vision perception models to create our COCO
Segmentation Text (COST) dataset for training and evaluating MLLMs on the
object perception task. Thirdly, we introduce metrics to assess the object
perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive
experimental evidence proving the VCoder's improved object-level perception
skills over existing Multimodal LLMs, including GPT-4V. We open-source our
dataset, code, and models to promote research. We open-source our code at
this https URL

人类具有视觉感知的出色技能，近期多模态大型语言模型（MLLM）在视觉与语言任务上取得了令人印象深刻的性能，但在识别或计数图像中的实体方面存在一些问题。为了改善多模态 LLM 在感知和推理方面的准确性，我们提出使用 VCoder 作为多模态 LLM 的感知工具，该工具能够通过接收分割或深度图等感知方式来提升多模态 LLM 的感知能力。此外，我们利用 COCO 图像和现成的视觉感知模型输出来创建 COST 数据集，用于训练和评估 MLLM 在对象感知任务上的表现。最后，我们提供了大量的实验证据，证明了 VCoder 在对象级感知能力上相比其他多模态 LLM（包括 GPT-4V）的改进。我们公开发布了我们的数据集、代码和模型，以促进相关研究。