Recognizing sounds is a key aspect of computational audio scene analysis and
machine perception. In this paper, we advocate that sound recognition is
inherently a multi-modal audiovisual task in that it is easier to differentiate
sounds using both the audio and visual modalities as opposed to one or the
other. We present an audiovisual fusion model that learns to recognize sounds
from weakly labeled video recordings. The proposed fusion model utilizes an
attention mechanism to dynamically combine the outputs of the individual audio
and visual models. Experiments on the large scale sound events dataset,
AudioSet, demonstrate the efficacy of the proposed model, which outperforms the
single-modal models, and state-of-the-art fusion and multi-modal models. We
achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming
prior state of the art by approximately +4.35 mAP (relative: 10.4%).

本文提出了一种音频视觉融合模型，该模型利用注意机制动态地结合单独的音频和视觉模型的输出来识别声音，实验证明该模型在音频场景分析和机器感知上比单模和多模融合模型具有更好的效果。