In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.

这项研究介绍了Libra，这是一个拥有解耦视觉系统的大型语言模型原型。Libra通过离散的自回归建模，训练视觉和语言输入数据，实现跨模态交互。实验证明，Libra的专门设计在图像到文本场景中提供了一个强大的MLLM基准，仅使用5000万个训练数据，为未来的多模态基础模型提供了新的视角。

Libra: 基于大型语言模型构建解耦视觉系统