Evaluating and Rethinking the current landscape of Large Multimodal Models
(LMMs), we observe that widely-used visual-language projection approaches
(e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet
ignore the visual knowledge-dimension alignment, i.e., connecting visuals to
their relevant knowledge. Visual knowledge plays a significant role in
analyzing, inferring, and interpreting information from visuals, helping
improve the accuracy of answers to knowledge-based visual questions. In this
paper, we mainly explore improving LMMs with visual-language knowledge
alignment, especially aimed at challenging knowledge-based visual question
answering (VQA). To this end, we present a Cognitive Visual-Language Mapper
(CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a
Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning
stage. Specifically, we design the VKA based on the interaction between a small
language model and a visual encoder, training it on collected image-knowledge
pairs to achieve visual knowledge acquisition and projection. FKA is employed
to distill the fine-grained visual knowledge of an image and inject it into
Large Language Models (LLMs). We conduct extensive experiments on
knowledge-based VQA benchmarks and experimental results show that CVLM
significantly improves the performance of LMMs on knowledge-based VQA (average
gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA,
respectively.

在当前大型多模态模型的研究中，我们评估和重新思考了广泛使用的视觉语言投射方法（如 Q-former 或 MLP），发现它们侧重于图像 - 文本描述的对齐，但忽略了视觉知识维度的对齐，即将视觉元素与相关知识连接起来。本文主要探索通过视觉语言知识对齐来改进大型多模态模型，特别关注知识型视觉问题回答。为此，我们提出了一个认知视觉语言映射器（CVLM），包含一个预训练的视觉知识对齐器（VKA）和一个用于多模态指令调整阶段的细粒度知识适配器（FKA）。我们通过在知识型视觉问题回答基准测试上进行广泛实验证明，CVLM 显著提高了 LMM 在知识型视觉问题回答上的性能（平均提升 5%），消融研究也验证了 VKA 和 FKA 的有效性。