Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

通过使用部分共享U-Net (PS-U-Net) 架构及新的有效的多模态采样方法，本研究成功开发出高质量的多模态文本和图像数据生成模型，同时具备与现有模型相比相当的大小、更快的训练速度、更快的多模态采样以及更灵活的生成。

高效的多模态扩散模型：联合数据填充与部分共享 U-Net