Do video-text transformers learn to model temporal relationships across
frames? Despite their immense capacity and the abundance of multimodal training
data, recent work has revealed the strong tendency of video-text models towards
frame-based spatial representations, while temporal reasoning remains largely
unsolved. In this work, we identify several key challenges in temporal learning
of video-text transformers: the spatiotemporal trade-off from limited network
size; the curse of dimensionality for multi-frame modeling; and the diminishing
returns of semantic information by extending clip length. Guided by these
findings, we propose SViTT, a sparse video-text architecture that performs
multi-frame reasoning with significantly lower cost than naive transformers
with dense attention. Analogous to graph-based networks, SViTT employs two
forms of sparsity: edge sparsity that limits the query-key communications
between tokens in self-attention, and node sparsity that discards uninformative
visual tokens. Trained with a curriculum which increases model sparsity with
the clip length, SViTT outperforms dense transformer baselines on multiple
video-text retrieval and question answering benchmarks, with a fraction of
computational cost. Project page: this http URL

通过引入边缘稀疏性和节点稀疏性的 SViTT 稀疏视频文本架构可以以较低的成本进行多帧推理，优于朴素变压器基线，并对多个视频文本检索和问答基准进行了训练，以及在更长的片段长度下是针对模型稀疏性（sparsity）进行了培训。