In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 190K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
this https URL

本文系统研究了仅基于视频生成音乐的方法，并提出了一个大规模数据集和一个名为 VidMuse 的简单框架，该框架通过在局部和全局可视线索的引导下，使用长短期模型创建与视频内容一致的音频轨迹，实现了高保真度的音乐生成及其与视频的音视一致性对齐。通过广泛的实验证明，VidMuse 在音频质量、多样性和音视对齐方面优于现有模型。

VidMuse: 一个简单的长短期建模视频音乐生成框架

VidMuse: A Simple Video-to-Music Generation Framework with  Long-Short-Term Modeling

Self-supervised sound source localization is usually challenged by the
modality inconsistency. In recent studies, contrastive learning based
strategies have shown promising to establish such a consistent correspondence
between audio and sound sources in visual scenarios. Unfortunately, the
insufficient attention to the heterogeneity influence in the different modality
features still limits this scheme to be further improved, which also becomes
the motivation of our work. In this study, an Induction Network is proposed to
bridge the modality gap more effectively. By decoupling the gradients of visual
and audio modalities, the discriminative visual representations of sound
sources can be learned with the designed Induction Vector in a bootstrap
manner, which also enables the audio modality to be aligned with the visual
modality consistently. In addition to a visual weighted contrastive loss, an
adaptive threshold selection strategy is introduced to enhance the robustness
of the Induction Network. Substantial experiments conducted on SoundNet-Flickr
and VGG-Sound Source datasets have demonstrated a superior performance compared
to other state-of-the-art works in different challenging scenarios. The code is
available at this https URL

通过引入感应网络和自适应阈值选择策略，本研究提出了一种用于自我监督声源定位的方法，以解决模态不一致性问题，并实现音频 - 视觉的一致对齐。在 SoundNet-Flickr 和 VGG-Sound Source 数据集上的实验证实了其在不同挑战场景中相较于其他最先进方法的优越性能。