Video understanding tasks have traditionally been modeled by two separate
architectures, specially tailored for two distinct tasks. Sequence-based video
tasks, such as action recognition, use a video backbone to directly extract
spatiotemporal features, while frame-based video tasks, such as multiple object
tracking (MOT), rely on single fixed-image backbone to extract spatial
features. In contrast, we propose to unify video understanding tasks into one
novel streaming video architecture, referred to as Streaming Vision Transformer
(S-ViT). S-ViT first produces frame-level features with a memory-enabled
temporally-aware spatial encoder to serve the frame-based video tasks. Then the
frame features are input into a task-related temporal decoder to obtain
spatiotemporal features for sequence-based tasks. The efficiency and efficacy
of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based
action recognition task and the competitive advantage over conventional
architecture in the frame-based MOT task. We believe that the concept of
streaming video model and the implementation of S-ViT are solid steps towards a
unified deep learning architecture for video understanding. Code will be
available at this https URL

提出了一种名为 “Streaming Vision Transformer” 的流式视频架构，利用具有内存功能的时间感知空间编码器产生帧级特征，供基于帧的视频任务使用；然后将帧级特征输入到与任务相关的时间解码器中，获得用于序列化任务的时空特征，该模型在行动识别任务中具有最先进的准确度，并在基于帧的多目标跟踪任务中具有竞争优势。