Video-based person recognition is challenging due to persons being blocked
and blurred, and the variation of shooting angle. Previous research always
focused on person recognition on still images, ignoring similarity and
continuity between video frames. To tackle the challenges above, we propose a
novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based
person recognition, which aggregates face features and incorporates them with
multi-modal information to identify persons in videos. For frame aggregation,
we propose a novel trainable layer based on NetVLAD (named AttentionVLAD),
which takes arbitrary number of features as input and computes a fixed-length
aggregation feature based on feature quality. We show that introducing an
attention mechanism to NetVLAD can effectively decrease the impact of
low-quality frames. For the multi-model information of videos, we propose a
Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of
multi-modality by adaptively updating Gram matrix. Experimental results on
iQIYI-VID-2019 dataset show that our framework outperforms other
state-of-the-art methods.

通过 Frame Aggregation and Multi-Modal Fusion framework 框架，将注意力机制引入 NetVLAD 模块进行视频内的人物识别，同时引入了 Multi-Layer Multi-Modal Attention 模块进行多模态相关性建模，实验结果表明该方法优于其他现有方法。