There has been a significant progress in text conditional image generation
models. Recent advancements in this field depend not only on improvements in
model structures, but also vast quantities of text-image paired datasets.
However, creating these kinds of datasets is very costly and requires a
substantial amount of labor. Famous face datasets don't have corresponding text
captions, making it difficult to develop text conditional image generation
models on these datasets. Some research has focused on developing text to image
generation models using only images without text captions. Here, we propose
CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide
multimodal text-image representations and strong image generation capabilities.
On the FFHQ dataset, our model outperformed previous state-of-the-art methods
by 4.4% in clipscore and generated very realistic images even when the text was
both in and out of distribution. The pretrained models and codes will soon be
available at this https URL

本文提出了利用预训练的 CLIP 模型来实现多模态文本 - 图像表示和强大的图像生成能力的 CLIP-VQDiffusion 模型，在 FFHQ 数据集上，该模型的 Clipscore 得分超过了之前最先进的方法 4.4%，并且即使在分布内外的情况下，生成的图像也非常逼真。