Video-based person re-identification (re-ID) aims at matching the same person across video clips. Efficiently exploiting multi-scale fine-grained features while building the structural interaction among them is pivotal for its success. In this paper, we propose a hybrid framework, Dense Interaction Learning (DenseIL), that takes the principal advantages of both CNN-based and Attention-based architectures to tackle video-based person re-ID difficulties. DenseIL contains a CNN Encoder and a Transformer Decoder. The CNN Encoder is responsible for efficiently extracting discriminative spatial features while the Transformer Decoder is designed to deliberately model spatial-temporal inherent interaction across frames. Different from the vanilla Transformer, we additionally let the Transformer Decoder densely attends to intermediate fine-grained CNN features and that naturally yields multi-scale spatial-temporal feature representation for each video clip. Moreover, we introduce Spatio-TEmporal Positional Embedding (STEP-Emb) into the Transformer Decoder to investigate the positional relation among the spatial-temporal inputs. Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.

本论文提出了一种基于CNN和Attention的混合框架(DenseIL)，其中，CNN编码器负责提取判别性的空间特征，而Dense Interaction解码器被设计为在帧与帧之间密集地建模空间-时间内在相互作用。与以往不同的是，我们还让Dense Interaction decoder密集地关注中间细粒度的CNN特征，从而自然地获得每个视频剪辑的多粒度空间-时间表示。此外，我们还在Dense Interaction解码器中引入Spatio-Temporal Positional Embedding，以研究空间-时间输入之间的位置关系。基于多个标准的基于视频行人重识别数据集，我们的实验结果一致而显著地优于所有最先进的方法。

基于密集交互学习的视频行人再识别