Pretraining egocentric vision-language models has become essential to
improving downstream egocentric video-text tasks. These egocentric foundation
models commonly use the transformer architecture. The memory footprint of these
models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego,
the first sparse egocentric video-text transformer model integrating edge and
node sparsification. We pretrain on the EgoClip dataset and incorporate the
egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE.
Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy
compared to LAVILA large, with no additional data augmentation techniques other
than standard image augmentations, yet pretrainable on memory-limited devices.

通过在 EgoClip 数据集上预训练，采用稀疏自我中心视频文本变换模型 SViTT-Ego，融合了边缘和节点稀疏化，以及友好的自我中心目标 EgoNCE，相对于 LAVILA large，在内部视频 EgoMCQ 上获得了 + 2.8% 的准确度提升，无需额外的数据增强技术，可以在内存有限的设备上预训练。