audio-visual event localization (AVEL) is the task of temporally localizing
and classifying \emph{audio-visual events}, i.e., events simultaneously visible
and audible in a video. In this paper, we solve AVEL in a weakly-supervised
setting, where only video-level event labels (their pr