Deep learning based visual to sound generation systems essentially need to be
developed particularly considering the synchronicity aspects of visual and
audio features with time. In this research we introduce a novel task of guiding
a class conditioned generative adversarial network with the temporal visual
information of a video input for visual to sound generation task adapting the
synchronicity traits between audio-visual modalities. Our proposed FoleyGAN
model is capable of conditioning action sequences of visual events leading
towards generating visually aligned realistic sound tracks. We expand our
previously proposed Automatic Foley dataset to train with FoleyGAN and evaluate
our synthesized sound through human survey that shows noteworthy (on average
81\%) audio-visual synchronicity performance. Our approach also outperforms in
statistical experiments compared with other baseline models and audio-visual
datasets.

本研究提出了一种基于深度学习的视听生成模型，通过使用时间上的视觉信息来引导生成模型输出音频，以适应视听模态之间的同步性，该模型能够生成逼真的视听同步音轨，并且在人员调查和统计实验中的表现优于其他基线模型和已有的视听数据集。