Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper proposes an alternative method called LMEye, a play-plug-in Interactive Perception Network for Large Language Models (LLMs), aiming to improve the accuracy of image understanding for the LVLM. Previous methods that infuse visual information into LLMs utilize a static visual mapping network, but lack dynamic interaction between the LLMs and visual information. LMEye addresses this issue by allowing the LLM to incorporate the visual information that aligned with human instruction. Specifically, the LMEye network consists of a static visual mapping network to provide the basic perception of an image to LLMs. Then, it also contains additional linear layers responsible for acquiring requests from LLMs, decomposing image features, and transmitting the interleaved information to LLMs, respectively. In this way, LLMs act to be in charge of understanding human instructions, sending it to the interactive perception network, and generating the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal question answering and reasoning tasks, demonstrating that it significantly improves the zero-shot performance of LLMs on multimodal tasks compared to previous methods.

本文提出了一种名为LMEye的交互感知网络，旨在提高Large Visual Language Model的图像理解精度。LMEye网络包括一个静态视觉映射网络和一些负责获取请求、分解图像特征和传输交错信息的线性层。通过在多模态问答和推理任务上进行广泛实验，我们证明LMEye显著提高了LLMs在多模态任务上的零-shot性能。

LMEye：用于大型语言模型的交互式感知网络