AbstractRecently,
diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop
→