Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

最新的图像编码器 VQ-VAE 已经能够使用自回归方法进行文本到图像的生成，但是这些方法尚未利用预训练语言模型的适应性，本研究通过调整预训练语言模型，对自回归文本到图像生成进行了探索，发现预训练语言模型的帮助有限，并提供了两方面的解释，即图像标记与文本标记的语义存在显著差异，导致预训练语言模型对它们的建模效果不如随机初始化模型，并且图像文本数据集中的文本标记与正常语言模型预训练数据相比过于简单，导致语言模型能力的灾难性降低。

预训练语言模型无助于自回归文本到图像生成