Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities
to perceive images and follow open-ended instructions. The capabilities of
MLLMs depend on two crucial factors: the model architecture to facilitate the
feature alignment of visual modules and large language models; the multimodal
instruction tuning datasets for human instruction following. (i) For the model
architecture, most existing models introduce an external bridge module to
connect vision encoders with language models, which needs an additional
feature-alignment pre-training. In this work, we discover that compact
pre-trained vision language models can inherently serve as ``out-of-the-box''
bridges between vision and language. Based on this, we propose Muffin
framework, which directly employs pre-trained vision-language models to act as
providers of visual signals. (ii) For the multimodal instruction tuning
datasets, existing methods omit the complementary relationship between
different datasets and simply mix datasets from different tasks. Instead, we
propose UniMM-Chat dataset which explores the complementarities of datasets to
generate 1.1M high-quality and diverse multimodal instructions. We merge
information describing the same image from diverse datasets and transforms it
into more knowledge-intensive conversation data. Experimental results
demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset.
Muffin achieves state-of-the-art performance on a wide range of vision-language
tasks, significantly surpassing state-of-the-art models like LLaVA and
InstructBLIP. Our model and dataset are all accessible at
this https URL

最近的多模态大型语言模型 (MLLMs) 在感知图像以及遵循开放性指令方面表现出令人印象深刻的能力。MLLMs 的能力取决于两个关键因素：用于实现视觉模块和大型语言模型特征对齐的模型架构以及用于人类指令跟随的多模态指令调整数据集。本研究发现，紧凑的预训练视觉语言模型天然地可以作为视觉和语言之间 ' 开箱即用 ' 的桥梁。基于此，我们提出了 Muffin 框架，直接使用预训练的视觉语言模型作为视觉信号的提供者。此外，我们还提出了 UniMM-Chat 数据集，探索了数据集之间的补充关系，生成了 1.1M 个高质量而多样化的多模态指令。实验结果表明 Muffin 框架和 UniMM-Chat 数据集的有效性。Muffin 在广泛的视觉语言任务中实现了最先进的性能，显著超过了 LLaVA 和 InstructBLIP 等最先进模型。我们的模型和数据集均可在此链接处访问。