Vision-language pre-training has significantly elevated performance across a
wide range of image-language applications. Yet, the pre-training process for
video-related tasks demands exceptionally large computational and data
resources, which hinders the progress of video-language models. This paper
investigates a straightforward, highly efficient, and resource-light approach
to adapting an existing image-language pre-trained model for dense video
understanding. Our preliminary experiments reveal that directly fine-tuning
pre-trained image-language models with multiple frames as inputs on video
datasets leads to performance saturation or even a drop. Our further
investigation reveals that it is largely attributed to the bias of learned
high-norm visual features. Motivated by this finding, we propose a simple but
effective pooling strategy to smooth the feature distribution along the
temporal dimension and thus reduce the dominant impacts from the extreme
features. The new model is termed Pooling LLaVA, or \nameofmethod{} in short.
\nameofmethod{} achieves new state-of-the-art performance on modern benchmark
datasets for both video question-answer and captioning tasks. Notably, on the
recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of
5 on average of five evaluated dimensions, exceeding the previous SOTA results
from GPT4V (IG-VLM) by 9\%. On the latest multi-choice benchmark MVBench,
PLLaVA achieves 58.1\% accuracy on average across 20 sub-tasks, 14.5\% higher
than GPT4V (IG-VLM). Code is available at
https://github.com/magic-research/PLLaVA.

通过引入一种简单但有效的汇聚策略，本文将图像 - 语言预训练模型应用于视频理解任务，并在问题回答和字幕生成等基准测试上取得了最新的最佳表现。

PLLaVA：基于图像到视频的无参数 LLaVA 扩展用于视频密集字幕

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video  Dense Captioning

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to
perceive objects in still images, but their application in video-related tasks,
such as object tracking, remains understudied. This lack of exploration is
primarily due to two key challenges. Firstly, extensive pretraining on
large-scale video datasets is required to equip MLLMs with the capability to
perceive objects across multiple frames and understand inter-frame
relationships. Secondly, processing a large number of frames within the context
window of Large Language Models (LLMs) can impose a significant computational
burden. To address the first challenge, we introduce ElysiumTrack-1M, a
large-scale video dataset paired with novel tasks: Referring Single Object
Tracking (RSOT) and Video Referring Expression Generation (Video-REG).
ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding
object boxes and descriptions. Leveraging this dataset, we conduct training of
MLLMs and propose a token-compression model T-Selector to tackle the second
challenge. Our proposed approach, Elysium: Exploring Object-level Perception in
Videos via MLLM, is an end-to-end trainable MLLM that makes the first attempt
to conduct object-level tasks in videos without requiring any additional
plug-in or expert models.

通过在大型视频数据集上进行大规模预训练，我们提出了一种全新的多模态大型语言模型（MLLM），名为 Elysium，该模型可以在视频中进行物体级任务，而无需任何其他插件或专家模型。

Elysium：透过 MLLM 探索视频中的物体层次感知

Elysium: Exploring Object-level Perception in Videos via MLLM

Inspired by the fact that human eyes continue to develop tracking ability in
early and middle childhood, we propose to use tracking as a proxy task for a
computer vision system to learn the visual representations. Modelled on the
Catch game played by the children, we design a Catch-the-Patch (CtP) game for a
3D-CNN model to learn visual representations that would help with video-related
tasks. In the proposed pretraining framework, we cut an image patch from a
given video and let it scale and move according to a pre-set trajectory. The
proxy task is to estimate the position and size of the image patch in a
sequence of video frames, given only the target bounding box in the first
frame. We discover that using multiple image patches simultaneously brings
clear benefits. We further increase the difficulty of the game by randomly
making patches invisible. Extensive experiments on mainstream benchmarks
demonstrate the superior performance of CtP against other video pretraining
methods. In addition, CtP-pretrained features are less sensitive to domain gaps
than those trained by a supervised action recognition task. When both trained
on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained
representation achieves much higher action classification accuracy than its
fully supervised counterpart on Something-Something dataset. Code is available
online: github.com/microsoft/CtP.

本文旨在通过使用追踪作为代理任务，设计了一个 Catch-the-Patch（CtP）游戏，让 3D-CNN 模型学习图像表示，以帮助视频相关任务的完成。经过广泛实验，CtP 预训练特征与其他视频预训练方法相比具有更优异的性能。