Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

现有的文本-图像模型在遵循复杂文本提示上困难重重，因此需要额外的基础输入以提高可控性。本研究提出将场景分解为可容纳细粒度细节、模块化、可解释的、易于构建的视觉基元-密集 Blob 表示。基于 Blob 表示，我们开发了一种基于 Blob 的文本-图像扩散模型 BlobGEN，用于组合生成。通过引入新的屏蔽式交叉注意力模块来解开 Blob 表示和视觉特征之间的融合，以发挥大型语言模型 (LLMs) 的组合性。我们引入了一种新的上下文学习方法来从文本提示生成 Blob 表示。我们广泛的实验表明，BlobGEN 在 MS-COCO 上实现了卓越的零样本生成质量和更好的布局引导可控性。当与 LLMs 结合使用时，我们的方法在组合图像生成基准上展现出卓越的数值和空间正确性。

密集斑点表示的组合式文本到图像生成