We introduce a multi-modal diffusion model tailored for the bi-directional
conditional generation of video and audio. Recognizing the importance of
accurate alignment between video and audio events in multi-modal generation
tasks, we propose a joint contrastive training loss to enhance the
synchronization between visual and auditory occurrences. Our research
methodology involves conducting comprehensive experiments on multiple datasets
to thoroughly evaluate the efficacy of our proposed model. The assessment of
generation quality and alignment performance is carried out from various
angles, encompassing both objective and subjective metrics. Our findings
demonstrate that the proposed model outperforms the baseline, substantiating
its effectiveness and efficiency. Notably, the incorporation of the contrastive
loss results in improvements in audio-visual alignment, particularly in the
high-correlation video-to-audio generation task. These results indicate the
potential of our proposed model as a robust solution for improving the quality
and alignment of multi-modal generation, thereby contributing to the
advancement of video and audio conditional generation systems.

我们介绍了一种多模态扩散模型，专为视频和音频的双向条件生成而设计。通过引入联合对比训练损失来增强视听事件的同步，我们认识到在多模态生成任务中准确对齐视频和音频事件的重要性。我们的研究方法包括对多个数据集进行全面实验，以全面评估我们所提出的模型的有效性。从各个角度进行了生成质量和对齐性能的评估，包括客观和主观指标。我们的研究结果表明，所提出的模型优于基线，证实了它的有效性和效率。特别地，对比损失的引入改善了音视频对齐，特别是在高相关性的视频到音频生成任务中。这些结果表明我们所提出的模型具有改善多模态生成的质量和对齐性的潜力，从而促进了视频和音频条件生成系统的发展。

CMMD：视频 - 音频条件建模的对比多模态扩散

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional  Modeling

We propose the first joint audio-video generation framework that brings
engaging watching and listening experiences simultaneously, towards
high-quality realistic videos. To generate joint audio-video pairs, we propose
a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled
denoising autoencoders. In contrast to existing single-modal diffusion models,
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising
process by design. Two subnets for audio and video learn to gradually generate
aligned audio-video pairs from Gaussian noises. To ensure semantic consistency
across modalities, we propose a novel random-shift based attention block
bridging over the two subnets, which enables efficient cross-modal alignment,
and thus reinforces the audio-video fidelity for each other. Extensive
experiments show superior results in unconditional audio-video generation, and
zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve
the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of
10k votes further demonstrate dominant preferences for our model. The code and
pre-trained models can be downloaded at
this https URL

本文介绍了一种基于 Multi-Modal Diffusion 模型，利用两个耦合的自编码器进行序列多模态非线性去噪，提出了一种随机平移注意力块用于跨模态对齐，以实现音视频帧的生成并提高音视频质量