We address action temporal localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in action temporal localization via multi-stage segment-based 3D ConvNets: (1) a proposal stage identifies candidate segments in a long video that may contain actions; (2) a classification stage learns one-vs-all action classification model to serve as initialization for the localization stage; and (3) a localization stage fine-tunes on the model learnt in the classification stage to localize each action instance. We propose a novel loss function for the localization stage to explicitly consider temporal overlap and therefore achieve high temporal localization accuracy. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increased from 15.0% to 19.0% on THUMOS 2014, when the overlap threshold for evaluation is set to 0.5.

本研究提出了一种基于三种分段3D卷积神经网络的方法，用于解决未经修剪的长视频中的时间动作定位问题，其中提出网络用于识别可能包含动作的候选段，分类网络以一对多动作分类模型进行学习以作为定位网络的初始化，用于定位每个动作实例。

使用多阶段CNN在未修剪的视频中进行时间动作定位