diffusion models have emerged as the new state-of-the-art family of deep
generative models, and their promising potentials for text generation have
recently attracted increasing attention. Existing studies mostly
Diffusion models have the potential for enhancing image-to-text generation and surpass Auto-Regressive models by introducing LaDiC, which incorporates context modeling, a dedicated latent space for captions, a regularization module, a diffuser for semantic conversion, and a Back&Refine technique, achieving state-of-the-art performance on the MS COCO dataset without pre-training or ancillary modules.