Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Our method not only improves performance across various benchmarks but also shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

本文解决了多模态大型语言模型在视觉和文本信息整合中的对齐问题。我们提出了一种创新的图像标记器，通过将字节对编码(BPE)原则应用于视觉数据，直接将结构先验信息融入图像符号，实现了更有效的多模态学习和推理。实验证明，该方法显著提升了模型的多模态理解能力，并展现出良好的可扩展性。

从像素到符号：量化视觉模态上的字节对编码