Masked visual modeling (MVM) has been recently proven effective for visual
pre-training. While similar reconstructive objectives on video inputs (e.g.,
masked frame modeling) have been explored in video-language (VidL)
pre-training, previous studies fail to find a truly effective MVM strategy that
can largely benefit the downstream performance. In this work, we systematically
examine the potential of MVM in the context of VidL learning. Specifically, we
base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where
the supervision from MVM training can be backpropagated to the video pixel
space. In total, eight different reconstructive targets of MVM are explored,
from low-level pixel values and oriented gradients to high-level depth maps,
optical flow, discrete visual tokens, and latent visual features. We conduct
comprehensive experiments and provide insights into the factors leading to
effective MVM training, resulting in an enhanced model VIOLETv2. Empirically,
we show VIOLETv2 pre-trained with MVM objective achieves notable improvements
on 13 VidL benchmarks, ranging from video question answering, video captioning,
to text-to-video retrieval.

本文系统研究了遮蔽视觉建模（MVM）在视频 - 语言（VidL）预训练中的应用，基于全面的端到端 VIdeO-LanguagE 变换器（VIOLET），提出了 8 种不同的 MVM 重构目标，从低级像素值到高级深度图、光流和潜在的视觉特征。实验结果表明，使用 MVM 目标进行预训练可以显著提高 VIOLETv2 模型的性能。

使用遮蔽视觉建模的端到端视频 - 语言变压器的实证研究

An Empirical Study of End-to-End Video-Language Transformers with Masked  Visual Modeling

Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet}
consist of three parts, a video encoder, a text encoder, and a video-text
fusion Transformer. They pursue better performance via utilizing heavier
unimodal encoders or multimodal fusion Transformers, resulting in increased
parameters with lower efficiency in downstream tasks. In this work, we for the
first time introduce an end-to-end video-language model, namely
\textit{all-in-one Transformer}, that embeds raw video and textual signals into
joint representations using a unified backbone architecture. We argue that the
unique temporal information of video data turns out to be a key barrier
hindering the design of a modality-agnostic Transformer. To overcome the
challenge, we introduce a novel and effective token rolling operation to encode
temporal representations from video clips in a non-parametric manner. The
careful design enables the representation learning of both video-text
multimodal inputs and unimodal inputs using a unified backbone model. Our
pre-trained all-in-one Transformer is transferred to various downstream
video-text tasks after fine-tuning, including text-video retrieval,
video-question answering, multiple choice and visual commonsense reasoning.
State-of-the-art performances with the minimal model FLOPs on nine datasets
demonstrate the superiority of our method compared to the competitive
counterparts. The code and pretrained model have been released in
this https URL

介绍了一种基于 all-in-one Transformer 的视频 - 语言端到端模型，采用新的 token rolling 操作，实现了视频数据的时间表示方式，同时赋予模型能够处理多模态输入的能力。该模型通过 fine-tuning 能够在文本 - 视频检索、视频问答、多项选择和视觉常识推理等多个数据集上达到 state-of-the-art 的性能表现。