This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

该研究探讨了文本条件扩散模型在推理过程中的跨注意力的作用。研究发现，跨注意力输出在几个推理步骤后趋于一个固定点。因此，收敛的时间点自然地将整个推理过程分为两个阶段：初始的语义规划阶段，此阶段模型依赖于跨注意力来规划与文本相关的视觉语义；以及接下来的提高保真度阶段，在此阶段模型试图根据之前规划的语义生成图像。令人惊讶的是，在提高保真度阶段忽略文本条件不仅降低计算复杂度，而且保持了模型的性能。这产生了一种简单且无需训练的有效生成方法，称为TGATE，它在收敛后缓存跨注意力输出，并在剩余推理步骤中保持固定。我们在MS-COCO验证集上的实证研究证实了其有效性。TGATE的源代码可在此 https URL 中获取。

文本到图像扩散模型中的交叉注意力使推理繁琐