Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet-$256 \times 256$ dataset, reducing 7$\times$ training cost compared to SiT and DiT with even better performance in terms of the FID-50K score. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.

本研究解决了传统类引导扩散模型在细节纹理生成上的不足，指出依赖粗略的类先验信息限制了模型性能。提出的“扩散上的扩散”（DoD）框架通过从先前生成的样本中提取视觉先验，提供丰富的引导信息，显著降低训练成本，同时提升生成图像的质量和细节。研究结果表明，DoD-XL模型在有限的训练步骤下，获得的FID-50K评分显著优于其他最先进的方法。

扩散模型需要视觉先验进行图像生成