BriefGPT.xyz
Jan, 2025
通过视觉组装声音进行音频到图像生成
Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
HTML
PDF
Darius Petermann, Mahdi M. Kalayeh
TL;DR
本研究解决了音频到图像生成模型训练所需的音视频配对数据稀缺问题。我们提出了一种可扩展的图像声化框架,通过现代视觉语言模型的推理能力,将不同模态的数据进行人工配对。研究结果显示,该方法训练的模型在性能上与最先进的技术相当,并展示了多种有趣的听觉能力,如语义混合和声场建模等。
Abstract
Training
Audio-to-Image
Generative Models
requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the
→