The advent of large vision-language models (LVLMs) has spurred research into
their applications in multi-modal contexts, particularly in video
understanding. Traditional VideoQA benchmarks, despite providing quantitative
metrics, often fail to encompass the full spectrum of video content and
inadequately assess models' temporal comprehension. To address these
limitations, we introduce MMBench-Video, a quantitative benchmark designed to
rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video
incorporates lengthy videos from YouTube and employs free-form questions,
mirroring practical use cases. The benchmark is meticulously crafted to probe
the models' temporal reasoning skills, with all questions human-annotated
according to a carefully constructed ability taxonomy. We employ GPT-4 for
automated assessment, demonstrating superior accuracy and robustness over
earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted
comprehensive evaluations that include both proprietary and open-source LVLMs
for images and videos. MMBench-Video stands as a valuable resource for the
research community, facilitating improved evaluation of LVLMs and catalyzing
progress in the field of video understanding. The evalutation code of
MMBench-Video will be integrated into VLMEvalKit:
this https URL

通过引入 MMBench-Video 来评估大规模视觉语言模型在视频理解方面的表现，该评估基准充分考虑视频内容，并充分评估模型的时间理解能力，从而为改进大规模视觉语言模型的评估提供了有价值的资源，促进了视频理解领域的进展。

MMBench-Video：一种用于整体视频理解的长形多镜头基准

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video  Understanding

Referring video segmentation relies on natural language expressions to
identify and segment objects, often emphasizing motion clues. Previous works
treat a sentence as a whole and directly perform identification at the
video-level, mixing up static image-level cues with temporal motion cues.
However, image-level features cannot well comprehend motion cues in sentences,
and static cues are not crucial for temporal perception. In fact, static cues
can sometimes interfere with temporal perception by overshadowing motion cues.
In this work, we propose to decouple video-level referring expression
understanding into static and motion perception, with a specific emphasis on
enhancing temporal comprehension. Firstly, we introduce an
expression-decoupling module to make static cues and motion cues perform their
distinct role, alleviating the issue of sentence embeddings overlooking motion
cues. Secondly, we propose a hierarchical motion perception module to capture
temporal information effectively across varying timescales. Furthermore, we
employ contrastive learning to distinguish the motions of visually similar
objects. These contributions yield state-of-the-art performance across five
datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement
on the challenging $\textbf{MeViS}$ dataset. Code is available at
this https URL

视频级别参照表达理解的静态与运动感知的解耦以及对时间感知的强化，并采用对比学习来区分视觉上相似的对象的运动，取得了在五个数据集上的最先进性能，并在具有挑战性的 MeViS 数据集上有了显著的 9.2% 的 J&F 改进。