Long-range and short-range temporal modeling are two complementary and
crucial aspects of video recognition. Most of the state-of-the-arts focus on
short-range spatio-temporal modeling and then average multiple snippet-level
predictions to yield the final video-level prediction. Thus, their video-level
prediction does not consider spatio-temporal features of how video evolves
along the temporal dimension. In this paper, we introduce a novel Dynamic
Segment Aggregation (DSA) module to capture relationship among snippets. To be
more specific, we attempt to generate a dynamic kernel for a convolutional
operation to aggregate long-range temporal information among adjacent snippets
adaptively. The DSA module is an efficient plug-and-play module and can be
combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform
powerful long-range modeling with minimal overhead. The final video
architecture, coined as DSANet. We conduct extensive experiments on several
video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400,
Something-Something V1 and ActivityNet) to show its superiority. Our proposed
DSA module is shown to benefit various video recognition models significantly.
For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is
improved from 74.9% to 78.2% on Kinetics-400. Codes are available at
this https URL

本文介绍了一种新颖的动态分段聚合（DSA）模块，该模块通过可调式地减少卷积操作来聚合相邻片段的长时程信息，结合 TSM、I3D 等开箱即用的基于剪辑的模型，提供了一种高效且具有优越性能的视频识别 DSANet 架构。

DSANet：视频层面表征学习的动态片段聚合网络

DSANet: Dynamic Segment Aggregation Network for Video-Level  Representation Learning

Temporal modeling still remains challenging for action recognition in videos.
To mitigate this issue, this paper presents a new video architecture, termed as
Temporal Difference Network (TDN), with a focus on capturing multi-scale
temporal information for efficient action recognition. The core of our TDN is
to devise an efficient temporal module (TDM) by explicitly leveraging a
temporal difference operator, and systematically assess its effect on
short-term and long-term motion modeling. To fully capture temporal information
over the entire video, our TDN is established with a two-level difference
modeling paradigm. Specifically, for local motion modeling, temporal difference
over consecutive frames is used to supply 2D CNNs with finer motion pattern,
while for global motion modeling, temporal difference across segments is
incorporated to capture long-range structure for motion feature excitation. TDN
provides a simple and principled temporal modeling framework and could be
instantiated with the existing CNNs at a small extra computational cost. Our
TDN presents a new state of the art on the Something-Something V1 & V2 datasets
and is on par with the best performance on the Kinetics-400 dataset. In
addition, we conduct in-depth ablation studies and plot the visualization
results of our TDN, hopefully providing insightful analysis on temporal
difference modeling. We release the code at this https URL

本文提出了一种新的视觉结构，称为 Temporal Difference Network (TDN)，其核心是通过一个高效的 Temporal Difference Module (TDM) 来捕获多尺度信息，以提高动作识别的效率。在 Something-Something V1＆V2 数据集上，TDN 呈现了一个新的最高水平，并且与 Kinetics-400 数据集上的最佳性能持平，同时我们还对 TDN 进行了深入的消融研究和可视化结果的绘制，为时序差分建模提供了全面的分析。

TDN：高效行为识别的时序差分网络

TDN: Temporal Difference Networks for Efficient Action Recognition

The paucity of videos in current action classification datasets (UCF-101 and
HMDB-51) has made it difficult to identify good video architectures, as most
methods obtain similar performance on existing small-scale benchmarks. This
paper re-evaluates state-of-the-art architectures in light of the new Kinetics
Human Action Video dataset. Kinetics has two orders of magnitude more data,
with 400 human action classes and over 400 clips per class, and is collected
from realistic, challenging YouTube videos. We provide an analysis on how
current architectures fare on the task of action classification on this dataset
and how much performance improves on the smaller benchmark datasets after
pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on
2D ConvNet inflation: filters and pooling kernels of very deep image
classification ConvNets are expanded into 3D, making it possible to learn
seamless spatio-temporal feature extractors from video while leveraging
successful ImageNet architecture designs and even their parameters. We show
that, after pre-training on Kinetics, I3D models considerably improve upon the
state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0%
on UCF-101.

本研究基于 Kinetics 数据集重新评估最先进的体系结构，并引入一种新的双流膨胀 3D ConvNet（I3D），该 ConvNet 可以在视频中学习无缝的时空特征提取器，利用成功的 ImageNet 架构设计及其参数，经过在 Kinetics 上的预训练后，I3D 模型在动作分类方面表现明显提高。