Stimulated by the sophisticated reasoning capabilities of recent Large
Language Models (LLMs), a variety of strategies for bridging video modality
have been devised. A prominent strategy involves Video Language Models
(VideoLMs), which train a learnable interface with video data to connect
advanced vision encoders with LLMs. Recently, an alternative strategy has
surfaced, employing readily available foundation models, such as VideoLMs and
LLMs, across multiple stages for modality bridging. In this study, we introduce
a simple yet novel strategy where only a single Vision Language Model (VLM) is
utilized. Our starting point is the plain insight that a video comprises a
series of images, or frames, interwoven with temporal information. The essence
of video comprehension lies in adeptly managing the temporal aspects along with
the spatial details of each frame. Initially, we transform a video into a
single composite image by arranging multiple frames in a grid layout. The
resulting single image is termed as an image grid. This format, while
maintaining the appearance of a solitary image, effectively retains temporal
information within the grid structure. Therefore, the image grid approach
enables direct application of a single high-performance VLM without
necessitating any video-data training. Our extensive experimental analysis
across ten zero-shot video question answering benchmarks, including five
open-ended and five multiple-choice benchmarks, reveals that the proposed Image
Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out
of ten benchmarks.

该研究提出了一种简洁而新颖的策略，利用图像网格的形式，将视频转化为单个复合图像，从而实现了对视频进行直接的高性能视觉语言模型的应用，无需任何视频数据训练，并在十个零样本视频问答基准中的九个基准中超越现有方法。

使用 VLM 进行零 - shot 视频问答：图像栅格能表达视频的价值

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering  Using a VLM

Existing long video retrieval systems are trained and tested in the
paragraph-to-video retrieval regime, where every long video is described by a
single long paragraph. This neglects the richness and variety of possible valid
descriptions of a video, which could be described in moment-by-moment detail,
or in a single phrase summary, or anything in between. To provide a more
thorough evaluation of the capabilities of long video retrieval systems, we
propose a pipeline that leverages state-of-the-art large language models to
carefully generate a diverse set of synthetic captions for long videos. We
validate this pipeline's fidelity via rigorous human inspection. We then
benchmark a representative set of video language models on these synthetic
captions using a few long video datasets, showing that they struggle with the
transformed data, especially the shortest captions. We also propose a
lightweight fine-tuning method, where we use a contrastive loss to learn a
hierarchical embedding loss based on the differing levels of information among
the various captions. Our method improves performance both on the downstream
paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for
the various long video retrieval metrics we compute using our synthetic data
(+3.6% R@1 for short descriptions on ActivityNet). For data access and other
details, please refer to our project website at
this https URL

通过对长视频生成多样的合成标题，使用大型语言模型评估长视频检索系统的能力，并提出轻量级微调方法（基于对不同标题中信息层级的差异进行对比损失学习），在下游的段落 - 视频检索任务以及使用合成数据计算的各种长视频检索度量上均有明显提升。