State of the art architectures for untrimmed video Temporal Action
Localization (TAL) have only considered RGB and Flow modalities, leaving the
information-rich audio modality totally unexploited. Audio fusion has been
explored for the related but arguably easier problem of trimmed (clip-level)
action recognition. However, TAL poses a unique set of challenges. In this
paper, we propose simple but effective fusion-based approaches for TAL. To the
best of our knowledge, our work is the first to jointly consider audio and
video modalities for supervised TAL. We experimentally show that our schemes
consistently improve performance for state of the art video-only TAL
approaches. Specifically, they help achieve new state of the art performance on
large-scale benchmark datasets - ActivityNet-1.3 (54.34 mAP@0.5) and THUMOS14
(57.18 mAP@0.5). Our experiments include ablations involving multiple fusion
schemes, modality combinations and TAL architectures. Our code, models and
associated data are available at this https URL

本文提出了简单而有效的基于融合的方法，首次同时考虑音频和视频模态用于监督式的未剪辑视频动作定位 (TAL)，在多个融合方案、模态组合和 TAL 架构的消融试验中，我们通过大规模基准数据集（ActivityNet-1.3 和 THUMOS14）实验性地表明，我们的方案在国内外领先的仅视频 TAL 方法中始终能提高性能，特别地在指标度量上（mAP@0.5）达到新的最优水平。