BriefGPT.xyz
Apr, 2023
SViTT: 稀疏视频文本Transformer的时间学习
SViTT: Temporal Learning of Sparse Video-Text Transformers
HTML
PDF
Yi Li, Kyle Min, Subarna Tripathi, Nuno Vasconcelos
TL;DR
通过引入边缘稀疏性和节点稀疏性的SViTT稀疏视频文本架构可以以较低的成本进行多帧推理,优于朴素变压器基线,并对多个视频文本检索和问答基准进行了训练,以及在更长的片段长度下是针对模型稀疏性(sparsity)进行了培训。
Abstract
Do
video-text transformers
learn to model
temporal relationships
across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-
→