自监督多感官特征的音频-视觉场景分析

Apr, 2018

自监督多感官特征的音频-视觉场景分析

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, Alexei A. Efros

TL;DR本文提出了一种融合多感官表征的方法，通过神经网络自动预测视频帧和音频的时间对齐情况，实现声音定位、视听行为识别和音频源分离等三个应用。

Abstract

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused →