Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models. Our Feint6K dataset and project page is available at https://feint6k.github.io.

最近的视频文本基础模型在各种下游视频理解任务中展现了强大的性能。然而，标准的视频文本评估可能会误导，因为许多问题仅可以通过单帧中的对象和上下文或数据集固有的偏见推断出来。本文旨在更好地评估当前视频文本模型的能力并了解其局限性。我们提出了一项新颖的视频文本理解评估任务——根据对照增强数据进行检索（RCAD），并创建了一个新的Feint6K数据集。实验和分析表明，我们的方法成功地学到了更有区分性的动作嵌入，并在多个视频文本模型上改善了Feint6K的结果。

重新思考视频文本理解：来自事实上增强数据的检索