Contrastive language-image pretraining has shown great success in learning
visual-textual joint representation from web-scale data, demonstrating
remarkable "zero-shot" generalization ability for various image tasks. However,
how to effectively expand such new language-image pretraining methods to video
domains is still an open problem. In this work, we present a simple yet
effective approach that adapts the pretrained language-image models to video
recognition directly, instead of pretraining a new model from scratch. More
concretely, to capture the long-range dependencies of frames along the temporal
dimension, we propose a cross-frame attention mechanism that explicitly
exchanges information across frames. Such module is lightweight and can be
plugged into pretrained language-image models seamlessly. Moreover, we propose
a video-specific prompting scheme, which leverages video content information
for generating discriminative textual prompts. Extensive experiments
demonstrate that our approach is effective and can be generalized to different
video recognition scenarios. In particular, under fully-supervised settings,
our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using
12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot
experiments, our approach surpasses the current state-of-the-art methods by
+7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In
few-shot scenarios, our approach outperforms previous best methods by +32.1%
and +23.1% when the labeled data is extremely limited. Code and models are
available at this https URL

本文提出一种简单有效的方法，将预先训练好的语言 - 图像模型直接应用于视频识别中，使用跨帧注意力机制及视频特定提示方案，实现对长时序列的检测，提高了零样本下的准确率。

扩展语言图像预训练模型以实现通用视频识别

Expanding Language-Image Pretrained Models for General Video Recognition

Vision-Language Navigation (VLN) is a challenging task that requires an
embodied agent to perform action-level modality alignment, i.e., make
instruction-asked actions sequentially in complex visual environments. Most
existing VLN agents learn the instruction-path data directly and cannot
sufficiently explore action-level alignment knowledge inside the multi-modal
inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT),
which provides the VLN agent with action prompts to enable the explicit
learning of action-level modality alignment to pursue successful navigation.
Specifically, an action prompt is defined as a modality-aligned pair of an
image sub-prompt and a text sub-prompt, where the former is a single-view
observation and the latter is a phrase like ''walk past the chair''. When
starting navigation, the instruction-related action prompt set is retrieved
from a pre-built action prompt base and passed through a prompt encoder to
obtain the prompt feature. Then the prompt feature is concatenated with the
original instruction feature and fed to a multi-layer transformer for action
prediction. To collect high-quality action prompts into the prompt base, we use
the Contrastive Language-Image Pretraining (CLIP) model which has powerful
cross-modality alignment ability. A modality alignment loss and a sequential
consistency loss are further introduced to enhance the alignment of the action
prompt and enforce the agent to focus on the related prompt sequentially.
Experimental results on both R2R and RxR show the superiority of ADAPT over
state-of-the-art methods.

本文提出了一种 Modality-Alignment Action Prompts (ADAPT) 方法，通过显式学习行动水平的模态对齐来实现对视觉环境下指令级操作的感知导航，并通过对高质量行动提示进行收集来提升对相关提示的对齐性。