Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

本研究解决了在多模态大语言模型中捕捉复杂图像细节的挑战，指出现有方法存在冗余和计算负担。通过提出一种多层特征融合器，动态提取浅层特征中的细节并与深层特征对齐，显著提高了视觉表示效果及基准性能，为细粒度视觉语言理解提供了更灵活且轻量级的解决方案。

MMFuser：用于细粒度视觉语言理解的多模态多层特征融合器