Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

通过参数空间对齐，我们提出了一种新颖的方法来表示视觉信息，将其表示为模型权重，并使用感知权重与LLM的权重进行合并。这种方法不需要视觉令牌作为LLM的输入，从而减少了输入序列的长度并大大提高了效率。我们的VLoRA基于此方法，通过感知权重生成器将视觉特征转换为低秩属性的感知权重，通过在各种基准测试中实验证明，VLoRA在MLLMs上实现了可比较的性能，并显著降低了训练和推断的计算成本。

大语言模型权重的视觉感知