We introduce a multi-modal diffusion model tailored for the bi-directional
conditional generation of video and audio. Recognizing the importance of
accurate alignment between video and audio events in multi-modal generation
tasks, we propose a joint contrastive training loss to enhance the
synchronization between visual and auditory occurrences. Our research
methodology involves conducting comprehensive experiments on multiple datasets
to thoroughly evaluate the efficacy of our proposed model. The assessment of
generation quality and alignment performance is carried out from various
angles, encompassing both objective and subjective metrics. Our findings
demonstrate that the proposed model outperforms the baseline, substantiating
its effectiveness and efficiency. Notably, the incorporation of the contrastive
loss results in improvements in audio-visual alignment, particularly in the
high-correlation video-to-audio generation task. These results indicate the
potential of our proposed model as a robust solution for improving the quality
and alignment of multi-modal generation, thereby contributing to the
advancement of video and audio conditional generation systems.

我们介绍了一种多模态扩散模型，专为视频和音频的双向条件生成而设计。通过引入联合对比训练损失来增强视听事件的同步，我们认识到在多模态生成任务中准确对齐视频和音频事件的重要性。我们的研究方法包括对多个数据集进行全面实验，以全面评估我们所提出的模型的有效性。从各个角度进行了生成质量和对齐性能的评估，包括客观和主观指标。我们的研究结果表明，所提出的模型优于基线，证实了它的有效性和效率。特别地，对比损失的引入改善了音视频对齐，特别是在高相关性的视频到音频生成任务中。这些结果表明我们所提出的模型具有改善多模态生成的质量和对齐性的潜力，从而促进了视频和音频条件生成系统的发展。