We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. Using low-level percepts can leads to high-dimensionality video representations. To mitigate this effect and control the model number of parameters, we introduce a variant of the GRU model that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations. We empirically validate our approach on both Human Action Recognition and Video Captioning tasks. In particular, we achieve results equivalent to state-of-art on the YouTube2Text dataset using a simpler text-decoder model and without extra 3D CNN features.

使用门控循环单元递归网络学习视频的感知特征，从深度卷积网络的各个层级提取感知特征，包括高级和低级特征，结合前者的特殊信息和后者的空间信息，使用改进的GRU模型控制模型参数的数量，并在人类动作识别和视频字幕生成任务中证明其有效性。

深入探究卷积网络用于学习视频表示