With the ever-increasing popularity of pretrained Video-Language Models
(VidLMs), there is a pressing need to develop robust evaluation methodologies
that delve deeper into their visio-linguistic capabilities. To address this
challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic
benchmark that places the assessment of fine-grained capabilities of these
models on a firm footing. Task-based evaluations, while valuable, fail to
capture the complexities and specific temporal aspects of moving images that
VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers
a controlled evaluation suite that sheds light on the true potential of these
models, as well as their performance gaps compared to human-level
understanding. ViLMA also includes proficiency tests, which assess basic
capabilities deemed essential to solving the main counterfactual tests. We show
that current VidLMs' grounding abilities are no better than those of
vision-language models which use static images. This is especially striking
once the performance on proficiency tests is factored in. Our benchmark serves
as a catalyst for future research on VidLMs, helping to highlight areas that
still need to be explored.

通过提出 ViLMA（视频语言模型评估）作为一个任务无关的基准，我们针对预训练的视频语言模型的微观能力开展了一个鲁棒的评估方法，该基准通过精心策划的反事实情况提供了一个控制的评估套件，揭示了这些模型的真实潜力以及与人类理解水平相比的性能差距。