TL;DR设计了一种适应预训练的 ViT 模型作为统一的长视频 Transformer 的新机制,以捕捉不同片段间的精细关系,并保持较低的计算开销和内存消耗,实现高效的时间动作检测。
Abstract
vision transformer (ViT) has shown high potential in video recognition, owing
to its flexible design, adaptable self-attention mechanisms, and the efficacy
of masked pre-training. Yet, it still remains unclear ho