Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilities of Large Language Models (LLMs) through the incorporation of visual perception interfaces. Despite the emergence of exciting applications and the availability of diverse instruction tuning data, existing approaches often rely on CLIP or its variants as the visual branch, and merely extract features from the deep layers. However, these methods lack a comprehensive analysis of the visual encoders in MLLMs. In this paper, we conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs. Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. Surprisingly, the vision-only model DINO, which is not pretrained with text-image alignment, demonstrates promising performance as a visual branch within MLLMs. By simply equipping it with an MLP layer for alignment, DINO surpasses CLIP in fine-grained related perception tasks. Building upon these observations, we propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination. Experimental results demonstrate the superior performance of COMM compared to existing methods, showcasing its enhanced visual capabilities within MLLMs. Code will be made available at https://github.com/YuchenLiu98/COMM.

该研究通过对多模态大型语言模型（MLLMs）中不同视觉编码器的有效性进行深入调查，发现了CLIP的浅层特征在细粒度任务（如定位和区域理解）中具有特殊优势。同时，研究还发现没有经过文本-图像对齐预训练的视觉模型DINO在MLLMs中作为视觉部分展现了有希望的性能，只需为其配备一个MLP层进行对齐，DINO在细粒度相关的感知任务中超过了CLIP。基于这些观察结果，研究提出了一种简单而有效的特征融合策略，称为COMM，它通过多层次特征融合将CLIP和DINO结合起来，以增强MLLMs的视觉能力。全面的实验证明了COMM相较于现有方法的卓越性能，展示了其在MLLMs中增强的视觉能力。

从CLIP到DINO：多模式大型语言模型中的视觉编码器喊出来