Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a compositional approach for text-to-image generation based on two stages. In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations (such as depth or segmentation maps) conditioned on text. In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model. Our findings indicate that such compositional approach can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

本研究针对现有文本到图像扩散模型在细粒度空间信息处理上的不足，通过提出一种两阶段的组合方法来优化图像生成。在第一阶段，设计基于扩散的生成模型生成与文本相关的中间表示；第二阶段则将这些表示与文本结合，生成最终图像。研究表明，该方法显著提高了图像生成质量，改善了FID和CLIP得分。

基于扩散模型的增强文本到图像生成的中间表示