Videos are more informative than images because they capture the dynamics of
the scene. By representing motion in videos, we can capture dynamic activities.
In this work, we introduce GPT-4 generated motion descriptions that capture
fine-grained motion descriptions of activities and apply them to three action
datasets. We evaluated several video-text models on the task of retrieval of
motion descriptions. We found that they fall far behind human expert
performance on two action datasets, raising the question of whether video-text
models understand motion in videos. To address it, we introduce a method of
improving motion understanding in video-text models by utilizing motion
descriptions. This method proves to be effective on two action datasets for the
motion description retrieval task. The results draw attention to the need for
quality captions involving fine-grained motion information in existing datasets
and demonstrate the effectiveness of the proposed pipeline in understanding
fine-grained motion during video-text retrieval.

通过引入生成的 GPT-4 运动描述应用于三个行动数据集并在运动描述检索任务上评估几种视频 - 文本模型，本研究探讨了视频与图片的信息性差异，着重关注视频 - 文本模型对于视频中运动的理解以及需要在现有数据集中加入细致动作信息的问题，并证明了利用动作描述提高视频 - 文本模型对于细致动作的理解的方法的有效性。

深入了解视频文本模型的运动表示

Diving Deep into the Motion Representation of Video-Text Models

Vision-Language models have shown strong performance in the image-domain --
even in zero-shot settings, thanks to the availability of large amount of
pretraining data (i.e., paired image-text examples). However for videos, such
paired data is not as abundant. Thus, video-text models are usually designed by
adapting pretrained image-text models to video-domain, instead of training from
scratch. All such recipes rely on augmenting visual embeddings with temporal
information (i.e., image -> video), often keeping text embeddings unchanged or
even being discarded. In this paper, we argue that such adapted video-text
models can benefit more by augmenting text rather than visual information. We
propose VicTR, which jointly-optimizes text and video tokens, generating
'Video-conditioned Text' embeddings. Our method can further make use of
freely-available semantic information, in the form of visually-grounded
auxiliary text (e.g., object or scene information). We conduct experiments on
multiple benchmarks including supervised (Kinetics-400, Charades), zero-shot
and few-shot (HMDB-51, UCF-101) settings, showing competitive performance on
activity recognition based on video-text models.

本文提出了使用 VicTR 方法对视频文本模型进行优化，在视觉信息外，加入文本信息，以提高活动识别性能，实验结果证明在多个基准测试中，该方法具有竞争性能，特别是在视频文本模型的监督、零样本和少样本情况下。