Video Foundation Models (ViFMs) aim to learn a general-purpose representation
for various video understanding tasks. Leveraging large-scale datasets and
powerful models, ViFMs achieve this by capturing robust and generic features
from video data. This survey analyzes over 200 video foundational models,
offering a comprehensive overview of benchmarks and evaluation metrics across
14 distinct video tasks categorized into 3 main categories. Additionally, we
offer an in-depth performance analysis of these models for the 6 most common
video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs,
which adapt existing image models for video tasks, 2) Video-Based ViFMs, which
utilize video-specific encoding methods, and 3) Universal Foundational Models
(UFMs), which combine multiple modalities (image, video, audio, and text etc.)
within a single framework. By comparing the performance of various ViFMs on
different tasks, this survey offers valuable insights into their strengths and
weaknesses, guiding future advancements in video understanding. Our analysis
surprisingly reveals that image-based foundation models consistently outperform
video-based models on most video understanding tasks. Additionally, UFMs, which
leverage diverse modalities, demonstrate superior performance on video tasks.
We share the comprehensive list of ViFMs studied in this work at:
https://github.com/NeeluMadan/ViFM_Survey.git

此研究简化了近 200 种视频基础模型，对 14 个不同的视频任务进行了综合概述，并在这些任务中对 6 个最常见的任务进行了性能分析。该研究发现，图像为基础的模型在大多数视频理解任务上持续表现优异，而利用多种模式的通用基础模型在视频任务上表现卓越。

视频理解基础模型综述

Foundation Models for Video Understanding: A Survey

Recent advancements in video saliency prediction (VSP) have shown promising
performance compared to the human visual system, whose emulation is the primary
goal of VSP. However, current state-of-the-art models employ spatio-temporal
transformers trained on limited amounts of data, hindering generalizability
adaptation to downstream tasks. The benefits of vision foundation models
present a potential solution to improve the VSP process. However, adapting
image foundation models to the video domain presents significant challenges in
modeling scene dynamics and capturing temporal information. To address these
challenges, and as the first initiative to design a VSP model based on video
foundation models, we introduce SalFoM, a novel encoder-decoder video
transformer architecture. Our model employs UnMasked Teacher (UMT) as feature
extractor and presents a heterogeneous decoder which features a locality-aware
spatio-temporal transformer and integrates local and global spatio-temporal
information from various perspectives to produce the final saliency map. Our
qualitative and quantitative experiments on the challenging VSP benchmark
datasets of DHF1K, Hollywood-2 and UCF-Sports demonstrate the superiority of
our proposed model in comparison with the state-of-the-art methods.

使用视频基础模型，我们引入了 SalFoM，一种新颖的编码器 - 解码器视频 Transformer 架构，通过特征提取器 UnMasked Teacher（UMT）和包含局部感知的时空 Transformer 的异构解码器，从多个角度融合局部和全局时空信息以生成最终的显著性图，并在 DHF1K、Hollywood-2 和 UCF-Sports 这些具有挑战性的 VSP 基准数据集上进行定性和定量实验证明了我们提出的模型相对于先进方法的优越性。

SalFoM: 动态视频基础模型的显著性预测

SalFoM: Dynamic Saliency Prediction with Video Foundation Models

Scale is the primary factor for building a powerful foundation model that
could well generalize to a variety of downstream tasks. However, it is still
challenging to train video foundation models with billions of parameters. This
paper shows that video masked autoencoder (VideoMAE) is a scalable and general
self-supervised pre-trainer for building video foundation models. We scale the
VideoMAE in both model and data with a core design. Specifically, we present a
dual masking strategy for efficient pre-training, with an encoder operating on
a subset of video tokens and a decoder processing another subset of video
tokens. Although VideoMAE is very efficient due to high masking ratio in
encoder, masking decoder can still further reduce the overall computational
cost. This enables the efficient pre-training of billion-level models in video.
We also use a progressive training paradigm that involves an initial
pre-training on a diverse multi-sourced unlabeled dataset, followed by a
post-pre-training on a mixed labeled dataset. Finally, we successfully train a
video ViT model with a billion parameters, which achieves a new
state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and
89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In
addition, we extensively verify the pre-trained video ViT models on a variety
of downstream tasks, demonstrating its effectiveness as a general video
representation learner. The code and model is available at
https://github.com/OpenGVLab/VideoMAEv2.

本文介绍了使用视频掩码自编码器（VideoMAE）进行可扩展和一般自监督预训练，用于构建视频基础模型和在各种下游任务中取得新的最先进表现的新方法。