TL;DR本研究旨在通过视觉线索从给定声音混合物中识别声音组件。本研究提出了两个模型,分别使用单个视频帧,以音源类别作为分离过程的信息。在 MUSIC 数据集实验中,两个模型相比于几种基线方法获得了可比较或更好的性能。
Abstract
visual sound source separation aims at identifying sound components from a
given sound mixture with the presence of visual cues. Prior works have
demonstrated impressive results, but with the expense of large multi-stage
architectures and complex data representations (e.g. optical flow