Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.

我们介绍了一种新颖的以对象为条件的能量驱动注意力映射对齐方法（EBAMA），以解决文本引导图像生成模型中存在的属性绑定错误和/或灾难性对象忽视的问题。通过最大化具有负采样技术的$z$参数化能量模型的对数似然，自然地产生了一种以对象为中心的属性绑定损失。我们进一步提出了以对象为中心的强度正则化器，以防止对象的注意力过度转移到其属性。在多个具有挑战性的基准测试中进行的广泛的定性和定量实验，包括人类评估，证明了我们的方法相对于先前的强对手的出色性能。通过更好地对齐注意力映射，我们的方法在进一步增强弥散模型的文本控制图像编辑能力方面显示出巨大的潜力。

文本到图像扩散模型中的对象条件能量注意力地图对齐