We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing generative model without any external data or fine-tuning. This is achieved by training a diffusion model conditioned on CLIP embeddings to sample latent vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model. We demonstrate that clip2latent allows us to generate high-resolution (1024x1024 pixels) images based on text prompts with fast sampling, high image quality, and low training compute and data requirements. We also show that the use of the well studied StyleGAN architecture, without further fine-tuning, allows us to directly apply existing methods to control and modify the generated images adding a further layer of control to our text-to-image pipeline.

引入了一种新的方法，可以从预训练的CLIP和StyleGAN中高效创建文本到图像模型，无需外部数据或微调。通过训练一个基于CLIP嵌入的扩散模型以对预先训练的StyleGAN的潜在向量进行采样，我们称之为clip2latent，利用CLIP图像和文本嵌入之间的对齐来避免需要任何文本标记数据来训练条件扩散模型。展示了clip2latent使我们能够根据文本提示生成高分辨率（1024x1024像素）的图像，并具有快速采样，高图像质量和低训练计算和数据要求。还展示了使用经过充分研究的StyleGAN架构，无需进一步微调，就可以直接应用现有方法来控制和修改生成的图像，为我们的文本到图像流水线添加了进一步的控制层面。

使用去噪扩散和CLIP对预训练StyleGAN进行文本驱动采样