Recently, the remarkable advance of the Large Language Model (LLM) has
inspired researchers to transfer its extraordinary reasoning capability to data
across several modalities. The prevailing approaches primarily regard visual
input as the prompt and focus exclusively on optimizing the text generation
process conditioned upon vision content by a frozen LLM. Such an inequitable
treatment of vision and language heavily constrains the model's potential. In
this paper, we break through this limitation by representing both vision and
language in a unified representation. To this end, we craft a visual tokenizer
that translates the non-linguistic image into a sequence of discrete tokens
like a foreign language that LLM can read. The resulting visual tokens
encompass high-level semantics worthy of a word and also support dynamic
sequence length varying from the image content. Coped with this visual
tokenizer, the presented foundation model called LaVIT (Language-VIsion
Transformer) can handle both image and text indiscriminately under a unified
generative learning paradigm. Pre-trained on the web-scale image-text corpus,
LaVIT is empowered with impressive multi-modal comprehension capability. The
extensive experiments showcase that it outperforms existing models by a large
margin on downstream tasks. Our code and models will be available at
this https URL

最近，大型语言模型的显著进展激发了研究人员将其非凡的推理能力转移到多个模态的数据上。本文通过在统一的表示中同时表达视觉和语言，突破了仅以视觉内容作为提示并专注于优化文本生成过程的限制。通过一个将非语言图像转换为 LLM 可以阅读的一系列离散标记的视觉分词器，LaVIT (Language-VIsion Transformer) 可以在统一的生成学习范式下无差别地处理图像和文本。在网上规模图像 - 文本语料库上预训练的 LaVIT 具有令人印象深刻的多模态理解能力。广泛的实验表明，它在下游任务上的性能超过现有模型很多。我们的代码和模型将在此 https URL 上提供。