Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

利用大型语言模型 (LLM) 从文本提示中提取关键组件，包括前景对象的边界框坐标、各个对象的详细文本描述和简洁的背景上下文。这些组件构成了布局到图像生成模型的基础，该模型通过两个阶段的操作实现，初步生成全局场景后，使用迭代细化方案对内容进行评估和修正，以确保与文本描述的一致性，从而在生成复杂的场景时展现出比传统扩散模型更好的召回率，经由用户研究进一步验证了我们的方法在从错综复杂的文本输入中生成连贯详细场景方面的功效。

LLM蓝图：通过复杂和详细的提示实现文本生成图像