Video-and-language understanding has a variety of applications in the
industry, such as video question answering, text-video retrieval and
multi-label classification. Existing video-and-language understanding methods
generally adopt heavy multi-modal encoders and feature fusion modules, which
consume large amounts of GPU memory. Especially, they have difficulty dealing
with dense video frames or long text that are prevalent in industrial
applications. In this paper, we propose MuLTI, a highly accurate and
memory-efficient video-and-language understanding model that achieves efficient
and effective feature fusion through feature sampling and attention modules.
Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we
introduce an attention-based adapter to the encoders, which finetunes the
shallow features to improve the model's performance with low GPU memory
consumption. Finally, to further improve the model's performance, we introduce
a new pretraining task named Multiple Choice Modeling to bridge the task gap
between pretraining and downstream tasks and enhance the model's ability to
align the video and the text. Benefiting from the efficient feature fusion
module, the attention-based adapter and the new pretraining task, MuLTI
achieves state-of-the-art performance on multiple datasets. Implementation and
pretrained models will be released.

本文提出了一种高精度、内存高效的视频和语言理解模型 MuLTI，通过特征采样和注意力模块实现了高效而有效的特征融合，引入了基于注意力的适配器来微调编码器的浅层特征以提高模型性能，最后引入了一种新的预训练任务 Multiple Choice Modeling 来增强模型对齐视频和文本的能力。该模型在多个数据集上实现了最新的性能，实现和预训练模型将被发布。

MuLTI: 多路径采样与多项选择模型实现高效视频和语言理解

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

The last several years have witnessed remarkable progress in
video-and-language (VidL) understanding. However, most modern VidL approaches
use complex and specialized model architectures and sophisticated pretraining
protocols, making the reproducibility, analysis and comparisons of these
frameworks difficult. Hence, instead of proposing yet another new VidL model,
this paper conducts a thorough empirical study demystifying the most important
factors in the VidL model design. Among the factors that we investigate are (i)
the spatiotemporal architecture design, (ii) the multimodal fusion schemes,
(iii) the pretraining objectives, (iv) the choice of pretraining data, (v)
pretraining and finetuning protocols, and (vi) dataset and model scaling. Our
empirical study reveals that the most important design factors include:
temporal modeling, video-to-text multimodal fusion, masked modeling objectives,
and joint training on images and videos. Using these empirical insights, we
then develop a step-by-step recipe, dubbed VindLU, for effective VidL
pretraining. Our final model trained using our recipe achieves comparable or
better than state-of-the-art results on several VidL tasks without relying on
external CLIP pretraining. In particular, on the text-to-video retrieval task,
our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming
current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains
state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA,
MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at:
this https URL

该研究论文分析了现代视频和语言（VidL）模型设计的最重要因素，其中包括时空建模、多模态融合、预训练数据选择和细调等，发现视频到文本多模态融合、掩蔽建模目标和图像和视频的联合训练等设计因素对于提高模型效果非常重要，提出了一种名为 VindLU 的有效 VidL 预训练新模型，达到了与现有模型可比甚至更好的性能表现，在多个任务上的性能指标均表现优异。

VindLU：一种实现视频与语言预训练的有效方法

VindLU: A Recipe for Effective Video-and-Language Pretraining

Most existing video-and-language (VidL) research focuses on a single dataset,
or multiple datasets of a single task. In reality, a truly useful VidL system
is expected to be easily generalizable to diverse tasks, domains, and datasets.
To facilitate the evaluation of such systems, we introduce Video-And-Language
Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets
over 3 popular tasks: (i) text-to-video retrieval; (ii) video question
answering; and (iii) video captioning. VALUE benchmark aims to cover a broad
range of video genres, video lengths, data volumes, and task difficulty levels.
Rather than focusing on single-channel videos with visual information only,
VALUE promotes models that leverage information from both video frames and
their associated subtitles, as well as models that share knowledge across
multiple tasks. We evaluate various baseline methods with and without
large-scale VidL pre-training, and systematically investigate the impact of
video input channels, fusion methods, and different video representations. We
also study the transferability between tasks, and conduct multi-task learning
under different settings. The significant gap between our best model and human
performance calls for future study for advanced VidL models. VALUE is available
at this https URL

本研究通过 VALUE 基准测试，探讨了基于多个数据集完成多个任务的视频与语言理解模型，重点考虑来源于多种视频类型，利用通过视频帧和相关字幕来分析视频文本内容，以及学习多任务的视频与语言理解能力，进一步推动视频与语言理解技术的发展。