Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for the spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos. Particularly, StNet stacks N successive video frames into a \emph{super-image} which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal relationship, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet. It employs a separate channel-wise and temporal-wise convolution over the feature sequence of video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

本文提出了一种新的空时网络（StNet）架构来进行局部和全局的空时建模，通过将N个连续的视频帧堆叠成一个超级图像，并对超级图像应用二维卷积来捕获局部空时关系，再对局部空时特征映射应用时间卷积来建模全局空时关系，该方法在动作识别方面优于现有技术，也在模型复杂度和准确度之间取得了理想的平衡，实验结果表明该方法可以广泛应用于视频表示的学习。

StNet：行动识别的局部和全局空时建模